he growing use of electronic medical records, electronic insurance claims, and other medical software systems is generating massive volumes of personal health information. While the primary purpose of capturing this information is to provide patient care, researchers have a voracious appetite for this flood of data. But what may be good for medical and health care research may not be good for you if hackers can identify you in these datasets.
Millions of records of personal health information are being sought by federal and private sector initiatives to conduct advanced medical research and precision medicine, by regulatory entities seeking better reporting and transparency of the results of clinical trials and medical device testing, and by researchers trying to measure the quality and safety of health care. These data can also be used to identify epidemiological patterns, establish provider certification and accreditation, and drive marketing and other business applications.
Of course, raw data from electronic health records and the like can’t be used for these purposes. Sharing personal information like names, addresses, or Social Security numbers would violate the Health Insurance Portability and Accountability Act (HIPAA) and be downright dangerous for consumers.
That’s where de-identification comes in. This process removes personal information like names and Social Security numbers while preserving demographic and other data such as diagnoses and geospatial data that is vital for research.
When done poorly, de-identification can backfire. Take, for example, a well-publicized case using a de-identified insurance claims dataset from Massachusetts. Latanya Sweeney, then a Massachusetts Institute of Technology graduate student, compared this database against simple demographic information included in the Cambridge, Mass., voter registration list, which she had purchased for $20.
Several fields in the two databases were the same, including date of birth, 5-digit residential ZIP code, and gender. By matching these, Sweeney was able to extract health records for then-Governor William Weld, who had been briefly hospitalized after collapsing during a college graduation ceremony.
Sweeney has since gone on to identify by name participants in the Personal Genome Project, a de-identified database of individuals who have volunteered to have their genomes sequenced. Other studies have also re-identified individuals from information that had been claimed to be de-identified.
These examples make news. But in reality, the success rate of these attacks is extremely low, and can often be chalked up to the fact that the data were not properly de-identified in the first place.
I have been studying and championing risk-based data de-identification for years. I believe that efforts to re-identify personal health information offer two key lessons about personal health information. 1) There is never zero risk in sharing data, just as there is never zero risk in taking a walk down the street. 2) Data sharing is inherently a risk-management exercise. Once you understand how to manage the risks, it’s possible to ensure that only a very small level of risk remains.
Simply removing information such as names and addresses from a dataset doesn’t render the data anonymous and ensure that an individual can’t be identified. Conferring real privacy protection means carefully assessing the re-identification risk, setting acceptable risk thresholds, and transforming the data using de-identification standards.
Several such guidelines and standards exist for doing this. The Health Information Trust Alliance (HITRUST), for example, recently released a de-identification framework that organizations can use when creating, accessing, storing, or exchanging personal information. Other organizations, like the Institute of Medicine and the Council of Canadian Academies have adopted similar standards that permit sharing sensitive data while managing the risks of re-identification. Current evidence shows that the risk of re-identification using these approaches is very small.
It’s important to keep in mind that a trade-off exists between privacy and utility. Creating a dataset with a low probability of re-identifying an individual means that it will likely be less useful for analysis.
While the goal of de-identification is to reduce to zero the chance that an individual in a dataset can be re-identified, that is impossible to guarantee.
Sharing health data offers great opportunity for innovation and improved patient outcomes. But these benefits need not come at the cost of diminishing privacy. By holding organizations to high privacy standards when sharing health data, we can move health care forward while still ensuring patient anonymity.
Sam Wehbe is a director at Privacy Analytics, a health care data de-identification technology company based in Ottawa, Ontario, where he is a champion for risk-based de-identification methodology and the key role it plays in preventing data breaches.