Dateline: May 18, 1996 – The collapse and attack. Massachusetts Governor William Weld wasn’t feeling well under his commencement cap and gown. He was about to receive an honorary doctorate from Bentley College and give their keynote graduation address. But, unbeknownst to him, he would instead make a critical contribution to the privacy of our health information. As he stepped forward to the podium, it wasn’t what Weld said that now protects your health privacy, but rather what he did: He teetered and collapsed unconscious before a shocked audience.

Weld recovered quickly and the incident might have passed quietly but for an MIT graduate student. Latanya Sweeney’s studies had brought to her attention hospital data released to researchers by the Massachusetts Group Insurance Commission (GIC) for the purpose of improving healthcare and controlling costs. Federal Trade Commission Senior Privacy Adviser Paul Ohm provides a gripping account of Sweeney’s now famous re-identification of Weld’s hospitalization data using voter list information in his 2010 paper “Broken Promises of Privacy.”

It would be difficult to overstate the influence of the Weld voter list attack on health privacy policy in the United States – it had a direct impact on the development of the de-identification provisions in the HIPAA Privacy rule. However, careful examination of the demographics in Cambridge, MA at the time of the re-identification attempt indicates that Weld was most likely re-identifiable only because he was a public figure who experienced a highly publicized hospitalization rather than there being any actual certainty about the accuracy of his attempted re-identification using the Cambridge voter data.

The Cambridge population was nearly 100,000 and the voter list contained only 54,000 of these residents, so the voter linkage could not provide sufficient evidence to allege any definitive re-identification. Because the logic underlying re-identification depends critically on being able to demonstrate that a person within a health data set is the only person in the larger population who has a set of combined “quasi-identifier” characteristics that could potentially re-identify them, re-identification attempts face a strong challenge in being able to create a complete and accurate population register. Furthermore, the same methodological flaws that undermined the certainty of the Weld re-identification continue to create far-reaching systemic challenges for all re-identification attempts – a fact which must be understood by public policy-makers seeking to realistically assess current privacy risks posed by HIPAA de-identified data. (The full details of these technical issues for re-identification risk assessment are available in a more lengthy review.)

With the benefit of hindsight, it is apparent that the Weld/Cambridge re-identification has served as an important illustration of privacy risks that were not adequately controlled prior to the 2003 HIPAA Privacy Rule. Still, a broader policy debate continues to rage between some voices, like Ohm, alleging that computer scientists can re-identify individuals hidden in anonymized data with “astonishing ease,” and others who view de-identified data as an essential foundation for a host of envisioned advances under healthcare reform.

Nowhere is this tension more evident within the health policy arena than in the recent proposal by the Office of the National Coordinator for Health Information Technology (ONC) for standards, services, and policies enabling secure health information exchange over the Internet to support the Nationwide Health Information Network (NwHIN). Motivated by concern that perceived re-identification risks could “undermine trust”, ONC proposes that de-identified health information could not be used or disclosed for any commercial purpose, a policy which would be certain to unleash a Pandora’s box of unintended consequences. Yet ONC also broadcasts their skepticism regarding purported re-identification risks by noting that they have been “somewhat exaggerated”.

Because a vast array of healthcare improvements and medical research critically depend on de-identified health information, the essential public policy challenge then is to accurately assess the current state of privacy protections for de-identified data, and properly balance both risks and benefits to maximum effect.

Re-Identification Risks Today Under the HIPAA Privacy Rule

HHS appropriately responded to the concerns raised by the Weld/Cambridge voter list privacy attack and, through the HIPAA Privacy Rules, acted to help prevent re-identification attempts.

In 2007, testifying before the Ad Hoc Workgroup on Secondary Uses of Health Data of the National Committee on Vital and Health Statistics, Dr. Latanya Sweeney reported that 0.04 percent (4 in 10,000) of the individuals in the U.S. population within data sets de-identified using the “Safe Harbor” method could be identified on the basis of their year of birth, gender and three-digit ZIP code. To provide some perspective, this risk falls slightly above the lifetime odds of being struck by lightning (one in 10,000).

Further boosting our confidence that re-identification is not a trivial task under today’s protections, a 2010 study estimated re-identification risks under the HIPAA Safe Harbor rule on a state-by-state basis using voter registration data. The percentage of a state’s population estimated to be vulnerable (i.e., not definitively re-identified, but potentially re-identifiable) ranged from 0.01 percent to 0.25 percent.

Another likely source of ONC’s skepticism about re-identification risks comes from ONC’s own 2011 study examining an attack on HIPAA de-identified data under realistic conditions, testing whether HIPAA Safe Harbor de-identified data could be combined with external data to re-identify patients. The study was performed under practical and plausible conditions and verified the re-identifications against direct identifiers—a crucial step often missing from this sort of study. The team began with a set of about 15,000 de-identified patient records. The experiment showed a match for only two of the fifteen thousand individuals (a re-identification rate of 0.013 percent), and even when maximally strong assumptions were made about the possible knowledge of the hypothetical intruder, the re-identification risk (under the questionable assumption that re-identification would even be attempted) was likely to be less than 0.22 percent.

Re-identification risks under the HIPAA Privacy Rule have been reduced to the point that most people wouldn’t (and shouldn’t) lose any sleep over the issue.

What’s At Stake For The Future Of Health Care?

Balancing privacy protection and scientific accuracy. Considerable costs come with incorrectly evaluating the true risks of re-identification under current HIPAA protections. It is essential to understand that de-identification comes at a cost to the scientific accuracy and quality of the healthcare decisions that will be made based on research using de-identified data. Balancing disclosure risks and statistical accuracy is crucial because some popular de-identification methods, such as “k-anonymity methods,” can unnecessarily, and often undetectably, degrade the accuracy of de-identified data for multivariate statistical analyses. This problem is well understood by statisticians and computer scientists, but not well-appreciated in the public policy arena. Poorly conducted de-identification and the overuse of de-identification methods in cases where they do not produce real privacy protections can quickly lead to “bad science” and damaging policy decisions.

Even worse, if we abandon the use of de-identified data because we falsely believe that de-identification cannot provide valuable privacy protections, we will lose the rich benefits that come from analysis of de-identified health data. Jane Yakowitz, a University of Arizona Law School Professor, wrote extensively on this topic in her paper, “Tragedy of the Data Commons,” and addresses the societal costs in information flow and knowledge growth that would follow the abandonment of a realistic assessment of the risks of re-identification.

The reality is that, while one can point to very few, if any, cases of persons who have been harmed by attacks with verified re-identifications, virtually every member of our society has routinely benefited from the use of de-identified health information. De-identified health data is the workhorse that supports numerous healthcare improvements and a wide variety of medical research activities. But just as we cannot identify the specific people who have had their lives saved by speed limit laws, we may fail to realize that we owe our lives to the ongoing research and health system improvements achieved with de-identified data. Hopefully, advancements will continue to accrue in generations to come, but unfounded fears of re-identification could derail this progress.

In my own career as an HIV epidemiologist, I have heightened concerns not only for the very important personal privacy of individuals, but also for the serious tragedies that would occur if fears about de-identification led to a failure to detect and control the next emerging infectious disease that begins to spread globally. If we abandon the use of de-identified data simply because of unwarranted fears regarding privacy risks under today’s HIPAA protections, the consequences of such misguided public policy could be truly disastrous. Privacy advocates and policymakers alike must better understand that, rather than posing new privacy risks, using de-identified data under HIPAA results in vast (thousands-fold) improvements in our individual privacy protection and also sustains a rich public good in research and healthcare improvements.

This critical role that de-identified health information plays in improving healthcare is becoming increasingly more widely recognized, but properly balancing the competing goals of protecting patient privacy while also preserving the accuracy of research requires policy makers to realistically assess both sides of this coin. De-identification policy must achieve an ethical equipoise between potential privacy harms and the very real benefits that result from the advancement of science and healthcare improvements which are accomplished with de-identified data. Properly implemented de-identification complying with the HIPAA de-identification provisions goes a long way toward promoting such a reasonable balance, but I would suggest that there is still room for further improvements in this regard.

Where should we go from here? Because re-identification attacks could still put rare but very real people—with names, faces, and personal lives—at risk of potential privacy harms, we should actively prohibit re-identification, and require those with access to de-identified data to guard and use it appropriately.

HHS Office of Civil Rights (OCR) regulators have promised to provide new guidance in the near future for the de-identification of health data in response to a Congressional mandate to do so. HHS OCR regulators should consider whether it is appropriate for de-identified data to fall entirely outside of the purview of the Privacy Rule, or whether, like the so called “Limited Data Sets” (LDSs), which have been stripped of 16 types of direct identifiers, de-identified data should be subject to certain terms in required Data Use Agreements (DUAs) or subject to direct HHS mandates for use conditions. Effective parallels to the LDS DUA can be carefully constructed to provide assurances which help to further limit re-identification concerns, but which also impose little unnecessary burden on appropriate uses of de-identified data.

Several recommended best practices for the use of de-identified data that should be considered by regulators as possible mandatory de-identified data use conditions include:

  1. Prohibiting of the re-identification, or attempted re-identification, of individuals and their relatives, family or household members. We should establish civil and criminal penalties for unauthorized re-identification of de-identified data (and for limited data sets). A carefully designed prohibition on re-identification attempts could still allow re-identification research approved by Institutional Review Boards (IRBs) to be conducted, but would ban re-identification attempts conducted without essential human subjects research protections.
  2. Requiring parties who wish to link new data elements (which might increase re-identification risks) with data de-identified under the Statistical De-identification provision of the Privacy Rule to confirm that the data remains de-identified.
  3. Specifying that HIPAA de-identification status would expire if, at any time, the data contains data elements specified within an evolving Safe Harbor list. The Safe Harbor list should be periodically updated by HHS to include any new “quasi-identifiers” for which population registries of sufficient completeness and accuracy might be reasonably constructed.
  4. Formally specifying that for statistically de-identified data, anticipated data recipients must always comply with specified time limits, data use restrictions, qualifications or conditions set forth in the statistical de-identification determination associated with the data.
  5. Requiring those holding and using de-identified data to implement and maintain appropriate data security and privacy policies, procedures and associated physical, technical and administrative safeguards as needed to assure that this data is: (a) accessed only by personnel or parties who have agreed to abide by the foregoing conditions, and (b) will remain de-identified in accordance with HIPAA de-identification provisions.
  6. Requiring those transferring de-identified data to third parties to enter into data use agreements which would oblige those receiving the data to also hold to the conditions list here, thus maintaining an important “chain-of-trust” data stewardship principal accompanying de-identified data throughout its uses.

Data use requirements of the sort suggested above would impose only modest impositions on the use of de-identified data and would help to provide recourse for actions against data intruders and parties who have not properly managed those very small re-identification risks that might still be associated with de-identified data.


William Weld’s 1997 “re-identification” had an important impact on improving healthcare privacy because it led to regulations that help to importantly protect patients from re-identification risks. But the Weld saga does not reflect the privacy risks that exist under the HIPAA Privacy rules today. We should not let today’s de minimus re-identification risks cause us to abandon our use of de-identified to protect privacy, save lives and continue to improve our healthcare system.

Hopefully, HHS regulators issuing impending de-identification guidance and considering the role of de-identified data for the NwHIN will correctly recognize that substantive protections for de-identification have already been importantly achieved and will carefully balance the substantial societal benefits that result from our ability to conduct analyses, innovate, and improve our healthcare systems using de-identified health data.