Skip to Main Content

The growing use of electronic medical records, electronic insurance claims, and other medical software systems is generating massive volumes of personal health information. While the primary purpose of capturing this information is to provide patient care, researchers have a voracious appetite for this flood of data. But what may be good for medical and health care research may not be good for you if hackers can identify you in these datasets.

Millions of records of personal health information are being sought by federal and private sector initiatives to conduct advanced medical research and precision medicine, by regulatory entities seeking better reporting and transparency of the results of clinical trials and medical device testing, and by researchers trying to measure the quality and safety of health care. These data can also be used to identify epidemiological patterns, establish provider certification and accreditation, and drive marketing and other business applications.

Of course, raw data from electronic health records and the like can’t be used for these purposes. Sharing personal information like names, addresses, or Social Security numbers would violate the Health Insurance Portability and Accountability Act (HIPAA) and be downright dangerous for consumers.


That’s where de-identification comes in. This process removes personal information like names and Social Security numbers while preserving demographic and other data such as diagnoses and geospatial data that is vital for research.

When done poorly, de-identification can backfire. Take, for example, a well-publicized case using a de-identified insurance claims dataset from Massachusetts. Latanya Sweeney, then a Massachusetts Institute of Technology graduate student, compared this database against simple demographic information included in the Cambridge, Mass., voter registration list, which she had purchased for $20.


Several fields in the two databases were the same, including date of birth, 5-digit residential ZIP code, and gender. By matching these, Sweeney was able to extract health records for then-Governor William Weld, who had been briefly hospitalized after collapsing during a college graduation ceremony.

Sweeney has since gone on to identify by name participants in the Personal Genome Project, a de-identified database of individuals who have volunteered to have their genomes sequenced. Other studies have also re-identified individuals from information that had been claimed to be de-identified.

These examples make news. But in reality, the success rate of these attacks is extremely low, and can often be chalked up to the fact that the data were not properly de-identified in the first place.

I have been studying and championing risk-based data de-identification for years. I believe that efforts to re-identify personal health information offer two key lessons about personal health information. 1) There is never zero risk in sharing data, just as there is never zero risk in taking a walk down the street. 2) Data sharing is inherently a risk-management exercise. Once you understand how to manage the risks, it’s possible to ensure that only a very small level of risk remains.

Simply removing information such as names and addresses from a dataset doesn’t render the data anonymous and ensure that an individual can’t be identified. Conferring real privacy protection means carefully assessing the re-identification risk, setting acceptable risk thresholds, and transforming the data using de-identification standards.

Several such guidelines and standards exist for doing this. The Health Information Trust Alliance (HITRUST), for example, recently released a de-identification framework that organizations can use when creating, accessing, storing, or exchanging personal information. Other organizations, like the Institute of Medicine and the Council of Canadian Academies have adopted similar standards that permit sharing sensitive data while managing the risks of re-identification. Current evidence shows that the risk of re-identification using these approaches is very small.

It’s important to keep in mind that a trade-off exists between privacy and utility. Creating a dataset with a low probability of re-identifying an individual means that it will likely be less useful for analysis.

While the goal of de-identification is to reduce to zero the chance that an individual in a dataset can be re-identified, that is impossible to guarantee.

Sharing health data offers great opportunity for innovation and improved patient outcomes. But these benefits need not come at the cost of diminishing privacy. By holding organizations to high privacy standards when sharing health data, we can move health care forward while still ensuring patient anonymity.

Sam Wehbe is a director at Privacy Analytics, a health care data de-identification technology company based in Ottawa, Ontario, where he is a champion for risk-based de-identification methodology and the key role it plays in preventing data breaches.

  • Catherine –
    Thanks for your comment on the article. You are completely right – the dataset Sweeney re-identified wasn’t de-identified as per HIPAA’s guidelines. In doing so, she brought to light a serious issue that the industry is still working to address as data sharing becomes more prevalent.

    HIPAA allows for two approaches to de-identification: Safe Harbor and Expert Determination. Research shows that a risk-based approach using the expert determination methodology ensures the least amount of risk of re-identification while preserving the greatest utility of the data for its intended use. Most globally accepted standards and guidelines are all based on this approach, including those from the Institute of Medicine (IOM), Health Information Trust Alliance (HITRUST), PhUSE, the Council of Canadian Academies, as well as HIPAA and the EU General Data Protection Directive.

    As the demands for access to patient data for research and analysis continue to grow, a scalable, risk-based approach is needed to ensure patient privacy, achieve compliance and instill trust. Cheers,

  • It’s also worth mentioning that for those looking for details on the de-identification regulations, HHS’s November 2012, HIPAA “Guidance regarding methods for De-identification of Protected Health Information” is easily accessible from the HHS website.

  • Catherine is indeed correct that all 5-digit Zip Codes and full Dates of Birth (or any dates more specific than the year) must be removed under the HIPAA Safe Harbor De-identification standards. However, the re-identification of Governor William Weld’s health data was conducted prior to the implementation of the HIPAA Privacy Rule on April 14, 2003. My 2012 Social Science Research Network paper “The ‘Re-Identification’ of Governor William Weld’s Medical Information: A Critical Re-Examination of Health Data Identification Risks and Privacy Protections, Then and Now” (Abstract 2076397) provides a detailed examination of this famous re-identification attack and of the de-identification protections now in effect through the HIPAA Privacy Rule.

  • The Personal Genome Project famously does NOT de-identify its data, and explicitly notes the potential threat of re-identification. Participants must pass entrance exam where they demonstrate understanding of the risk of potential identifiability of their data. Some participants actually put their names on their profile/data pages.

  • Just a small point of order. The phrasing in the article makes it sound like the data the grad student “reidentified” was initially decide tidied through the method of removing name, address and SSN. According to the US HIPAA law, 5 digit zip code and Date of Birth are also identifiers (the law specifies 18 kinds of information that are considered identifiers). This dataset never was deidentified in accordance with HIPAA and should not pass muster as such at an IRB.
    This is a great article about the risk of data collection. I’d love to share this with my researchers as an example of why we at the IRB are so vigilant when it comes to the information that is being collected and how it will be stored or transferred.
    Thank you,

Comments are closed.