Back to Blog

Clinical data is some of the richest signal available for machine learning. Notes from COVID hospitalizations, for example, contain temporal progressions, comorbidity patterns, medication responses, and demographic context that no synthetic dataset can replicate. The problem is that data is also about real people, recorded at some of the most vulnerable moments of their lives.

De-identification is the process of stripping enough identifying information that a dataset can be shared or analyzed without exposing individuals to privacy risk. In healthcare, HIPAA sets the legal floor. But HIPAA compliance and "actually private" aren't the same thing — and closing that gap while preserving analytical value is a harder problem than it looks.

This post covers the real tradeoffs involved: what you're required to remove, what you lose when you remove it, and how to make de-identification decisions that hold up when someone tries to reassemble the puzzle.

What HIPAA Actually Requires

HIPAA's Safe Harbor method defines 18 categories of Protected Health Information (PHI) that must be removed or transformed before a dataset is considered de-identified:

  • Names
  • Geographic data smaller than state level (street addresses, cities, ZIP codes in most cases)
  • Dates more specific than year, for individuals 90 or older
  • Phone numbers, fax numbers
  • Email addresses
  • Social Security numbers
  • Medical record and health plan numbers
  • Account and certificate numbers
  • Vehicle identifiers and license plates
  • Device identifiers and serial numbers
  • Web URLs and IP addresses
  • Biometric identifiers (fingerprints, voiceprints)
  • Full-face photographs
  • Any other unique identifier

The alternative to Safe Harbor is the Expert Determination method, where a qualified statistician certifies that the risk of re-identification is very small. Expert Determination allows you to retain more data in exchange for a formal risk analysis — it's more flexible, but it's also more expensive and harder to defend if something goes wrong.

For most ML teams without a dedicated privacy compliance function, Safe Harbor is the practical path. The question is how to apply it without gutting the dataset's usefulness.

The Date Binning Tradeoff

Dates are where the tension between privacy and utility is sharpest.

Clinical timelines matter enormously for machine learning. The sequence of events — admission date, symptom onset, lab results, medication changes, discharge — is often as predictive as the events themselves. Binning all dates to year-level destroys this temporal structure. A model trained on "2020 data" can't tell whether a patient deteriorated over two days or two weeks.

The practical middle ground is to preserve relative time while stripping absolute time. Rather than keeping exact dates, convert them to offsets from a reference point — "day 3 of hospitalization," "14 days after diagnosis." This preserves the temporal relationships that drive predictive signal while making it impossible to link the records to a real calendar date.

For COVID clinical notes specifically, preserving month and year (without day) often strikes a reasonable balance. The course of the pandemic mattered — outcomes in April 2020 looked different from outcomes in January 2021 when different variants were circulating and treatment protocols had evolved. Stripping month entirely removes context that's scientifically meaningful. The Safe Harbor standard allows year retention for patients under 90, and allows month-year retention when the combination doesn't uniquely identify individuals in the population.

I applied a similar approach in a crime data analysis on LAPD records: precise occurrence timestamps were converted to month/year, day of week, and hour of occurrence. This preserved enough temporal structure to track monthly trends and hourly crime patterns while preventing any single data point from being pinned to a specific date and time for a specific location — which, combined with the coordinates, could narrow down the incident to a handful of real households.

ZIP Truncation

Geographic data is a persistent re-identification vector. Your ZIP code, combined with your age and sex, uniquely identifies a large fraction of the U.S. population — this has been demonstrated repeatedly in privacy research going back to Latanya Sweeney's work in the 1990s.

HIPAA's Safe Harbor standard allows three-digit ZIP codes (the first three digits of a five-digit ZIP), with an additional restriction: if the three-digit region contains fewer than 20,000 people, that code must also be suppressed (typically replaced with 000). This prevents very small geographic areas from being used to narrow a population down to a size where re-identification becomes trivial.

The tradeoff is analytical resolution. Five-digit ZIP codes map to relatively small geographic areas — useful for neighborhood-level health equity analysis, identifying healthcare deserts, or studying how COVID outcomes varied block by block. Three-digit ZIPs collapse those distinctions. Anything that requires neighborhood-level geography is significantly impaired.

One practical approach is to substitute a socioeconomic proxy — a region-level deprivation index, urbanicity classification, or census-tract-level median income — rather than the geographic identifier itself. This preserves the socioeconomic signal (which is often what the analysis actually needs) while removing the location specificity that drives re-identification risk.

Name and Identifier Redaction

Removing names, IDs, and contact information from structured fields is conceptually simple but operationally hard in clinical data. The problem is that clinical notes are free text.

Physicians write in natural language. Names appear in the body of notes — "patient John reports," "spouse Mary confirmed," "called Dr. Reyes." Hospital identifiers appear in the middle of sentences. Dates get written out as prose. Standard field-level redaction doesn't touch any of this; you need Named Entity Recognition (NER) models that can identify PHI in unstructured text and either redact or replace it.

The state of the art here is transformer-based NER trained on annotated clinical corpora — models like those from the i2b2 shared task, or fine-tuned clinical BERT variants. These can achieve high recall on structured PHI categories (names, dates, locations), but they're not perfect. Rare name forms, unusual abbreviations, and implicit identifiers ("the only COVID patient admitted to this ward that week") require human review at the margin.

The practical implication: NER-based de-identification of clinical notes requires a validation loop, not just a pipeline. You need to sample outputs, review failures, and iterate on the model before treating the de-identified corpus as safe for downstream use.

The Mosaic Effect

This is the hardest part to get right, and the part most often underestimated.

Individual pieces of de-identified data may look innocuous in isolation. A patient record with year-level dates, three-digit ZIP, and no name seems safely anonymous. But identity is often reconstructable from combinations of quasi-identifiers that are, individually, benign.

Age, sex, three-digit ZIP, admission date (even year-level), and primary diagnosis together can uniquely identify a patient in a small population. The more unusual the combination — a rare diagnosis, a very elderly patient in a rural ZIP — the easier re-identification becomes. This is the mosaic effect: information that seems aggregated enough to be safe becomes individually identifying when the pieces are assembled.

The standard privacy-preserving response is k-anonymity: ensure that for every combination of quasi-identifiers in your dataset, at least k records share that combination. A record with k=1 is unique and potentially re-identifiable. k=5 means at least five records share those characteristics, reducing re-identification risk proportionally. Extensions like l-diversity and t-closeness add further constraints on the distribution of sensitive values within those groups.

In practice, enforcing k-anonymity often requires suppressing or further generalizing records in the "long tail" — unusual combinations that don't have enough peers. This creates a real tradeoff: the patients who are most unusual are often medically the most interesting, and they're also the ones most at risk of re-identification. De-identification tends to erase exactly the cases that matter most for rare disease research.

Lessons from Real Data Work

Working with the LAPD crime dataset — a public dataset with no HIPAA obligations — still required many of the same de-identification decisions you'd face in a clinical context. Coordinates were already rounded to the nearest hundred-block by the LAPD. I made additional choices: aggregating premise descriptions into broad categories ("Public" vs. "Residential") rather than preserving specific addresses, dropping precise timestamps after extracting coarser temporal features, and avoiding combinations of features that could narrow an incident to a specific identifiable individual.

These decisions directly shaped what the analysis could and couldn't show. Grouping locations by category prevented mapping crime to specific buildings — which was the right call ethically, but it also meant the spatial analysis could confirm that vehicle theft happens on streets (which anyone could observe), without being able to pinpoint which specific street segments or parking facilities were highest-risk.

That's the core tension in de-identification: every step that reduces privacy risk also reduces analytical resolution. There's no free lunch. The goal is to make that tradeoff explicitly and deliberately, rather than discovering it after the fact when a downstream analysis can't answer the question it was built to answer.

A Framework for Making These Decisions

When I'm working through de-identification choices on a new dataset, I find it useful to think through three questions in order:

  1. What's the minimum data needed to answer the research question? Start with data minimization. If the analysis doesn't require street-level geography, don't collect or retain it. The safest data is data you never had.
  2. What combinations of retained fields create re-identification risk? Think like an adversary. Given what's left in the dataset, what could someone infer if they had access to another data source — insurance records, voter rolls, a news article about a rare diagnosis? Run through realistic attack scenarios.
  3. What's the plan when the de-identification fails? NER models miss things. Suppression rules have edge cases. Data gets merged in unexpected ways. A breach response plan and clear data governance policy don't replace good de-identification, but they're necessary when de-identification isn't sufficient.

None of this is a substitute for working with a privacy officer or legal team when the stakes are high. But framing the decisions explicitly — as deliberate tradeoffs rather than technical checkboxes — leads to better outcomes than treating de-identification as a one-time pipeline step.

The Bottom Line

Clinical data is too valuable to ignore and too sensitive to handle carelessly. De-identification is what makes it possible to use that data for research and machine learning without exposing real people to privacy risk — but only if the de-identification is done thoughtfully.

HIPAA's Safe Harbor gives you a legal floor, not a ceiling. Meeting the 18-identifier standard is necessary but not sufficient. The mosaic effect, reporting bias in who gets recorded, and the long tail of rare cases all require active thought, not just rule-following.

The practical goal is a dataset that's useful enough to drive real research and private enough that you'd be comfortable defending every decision if a record were ever reconstructed. That's a higher bar than compliance, but it's the right bar.