Blog Home

«
»

The Privacy Conundrum And Genomic Research: Re-Identification And Other Concerns



September 11th, 2013
by Joel Kupersmith

No matter what the arena — finance, health care, or national security — questions surrounding the provision of personal data are always the same:  how much benefit vs. how much risk?  Who handles these data, and can those individuals be trusted?  How do organizations guard against data misuse?   What are the legal safeguards to protect privacy, and are they sufficient in an era when more data are shared more widely?

Nowhere is the privacy discussion more personal than in genomics, the very hardwiring of our existence.  Genomic data are unique to individuals (or identical twins) and, except for occasional mutations, do not change over a lifetime, thereby rendering disclosures permanent.  Genomic data also have special properties regarding privacy, especially as comprehensive whole genome sequencing becomes the major technique.

The benefits of amassing genomic data in sufficient case numbers for validity and making this knowledge available to an appropriately wide body of expert investigators are extensive. Research derived from genomic databases offers potentially large health payoffs.  Genomics can help scientists predict who will develop a disease (e.g., Huntington’s disease) and tailor treatments.  It also holds the potential to bring about a paradigm shift in how we think about and classify disease; i.e., allowing us to move from the pathology-based approach begun in the late 19th century — which focuses on the progression of disease in a specific organto a biochemical-and genomics-based approach.  This new approach is already being applied to a number of diseases, including certain cancers.

Yet the damage caused by the misuse of genomic data can be irreparable.  Disclosure of genomic information can have stigmatizing consequences for both the individual and family, particularly first-degree relatives.  These consequences include employment discrimination, denial of life insurance, and inappropriate marketing.  And that is the conundrum: Realizing the promise of genomics, while safeguarding genomic data and protecting individual privacy.

Over the past 40 years, federal and state policymakers have addressed various pieces of the privacy puzzle in laws and rules, many of which relate to genomic research.  Among the more notable are the 1974 Privacy Act, which protects systems of federal government records; Certificates of Confidentiality statutes (established for AHRQ and NIH investigators), which restrict data use outside of research and protect research subjects from forced disclosure for various proceedings; the Common Rule, which in 1991 mandated that Institutional Review Boards oversee research privacy and security; HIPAA, which was passed five years later and protects privacy of clinical information via  the “safe harbor” of 18 specific identifiers (research databases are given  leeway to not include some identifiers); and, more recently, the Genetic Information Nondiscrimination Act of 2008, which protects individuals from genetic discrimination via denial of employment or health insurance coverage.

Additional activity at the federal level includes The Presidential Commission for the Study of Bioethical Issues (“Presidential Commission”) and its recent   report, ”Privacy and Progress in Whole Genome Sequencing”, among other initiatives.  There is also a patchwork of privacy policies and laws in about half the states.

Until now, anonymization of data, database security, and these protective laws have been the primary safeguards against security lapses in genomic data made partially or largely available to the public. However, various factors, including formation of very large databases, and data sharing and access by large numbers of individuals have put new strains on genomic security.

The Problem Of Re-Identification

In addition — and most importantly — the recent availability of large amounts of data in various public and non-public registries coupled with advanced approaches to data matching, has resulted in a new form of data hacking.  Under the right circumstances, a persistent individual can overcome barriers to identification and “re-identify” an individual in an anonymized database.  The ability to do so, generally speaking, requires having access to at least two or more databases: a “candidate” database, which holds the anonymized data, and one or more “reference” databases, which share elements (e.g., zip codes) in common.

Re-identification of an individual or small group is essentially achieved by using “quasi-identifiers” to cross-reference specific elements that are included along with the genetic data, but also found in other databases; i.e., elements that can directly or indirectly narrow  down  the  numbers of a given group or subgroup to recognize the subject.

For example, zip codes include many individuals, but they do not directly identify subjects.  However, when zip codes are cross-matched with information from other databases, such as driver vehicle information, the number of individuals is whittled down. Further cross-matching with, for example voter registration files, can whittle the number down even further, until a particular individual is identified.  While this re-identification problem itself is not new, the prospect of re-identifying genomic data has elevated concerns about privacy.

Such concern is justifiable, particularly given the variety of available reference databases: hospital quality data; ICD-9 codes (powerful identifiers);  death master files (in which social security numbers can be adduced); social security databases; vehicular databases; voter registrations lists;  house sales; U.S. census data; public records search engines (e.g., PeopleFind.com); and many others.  With regard to genomics databases specifically, private genetic and genealogy databases and even listings of relatives in obituaries can be used as reference databases for matching. The more specific fields are in the reference databases, the better they serve as quasi-identifier.

Because genes have specifically identifiable common elements (haplotypes, mutations) and are also shared with blood relatives, they offer distinct possibilities for re-identification. For example, a recent study showed how individuals could be re-identified in what was previously considered to be an “anonymous” database, the 1000 Genomes.  In addition to genomic data, this database included year of birth and state of residence (neither of which is protected by HIPAA from disclosure).

To re-identify, researchers first identified single nucleotide polymorphisms (SNPs, essentially single point mutations) on the Y (male) chromosome — called Y-STR markers — in the genomes of 32 European ancestry individuals. These particular SNPs distinguish male lines in families. Next, they plugged the Y-STR information into two publicly available recreational genealogy databases, giving them access to about 40,000 surnames and pedigrees.  They found eight matches to Mormon families in Utah.  Then, using information from the Coriell Cell Repository, obituary data, and other public sources, they re-identified nearly 50 individuals from three distinct pedigrees, including both male and female family members.  Only 3-7 hours was needed for a single reviewer to identify a complete pedigree.

After this privacy breach, year of birth was removed from the public database and participants in the 1000 Genomes project were re-consented.  Specifically, they were informed of the broad access anticipated for these genomic data and the remaining possibility of re-identification.  It is important to note that the re-identification was entirely legal; i.e., it was accomplished within the existing framework of current law and regulations as well as policies established by organizations overseeing databases. Further, re-identification similar to the 1000 Genome project  has been accomplished in other genomic databases.

Although special characteristics in this situation arguably facilitated re-identification (uncommon surnames, connection to a less densely populated state such as Utah), the message is clear: the privacy of “de-identified” data in genomic databases is at risk, both for individuals, and family members.  In addition, the already persistent threats of poor database security and purposeful or inadvertent disclosure by those with access to data are now magnified by the need to share information more widely (among those with the appropriate expertise).  As the number of databases and family pedigree information increase, and the number of individuals working with data also grows, so does the risk. As the Presidential Commission recently noted, “Without … (privacy) assurance in place, individuals are less likely to voluntarily supply the data that have the potential to benefit us all life-saving treatments for genetic diseases.”

A Path Forward

To maintain crucial public trust, organizations need to re-examine existing database policies, technology must be continually updated, and public policy revisited.

Database policies.  Transparency, strict governance and rules, and provision of a ”floor of privacy protection” are essential to good data stewardship. Therefore, any organization responsible for genomic databases should develop an explicit management structure, data access and use policies, and clearly defined roles.  These roles might include functions — access control manager, security manager, etc. — that are analogous to the oversight of Randomized Clinical Trials.

As part of this structure, oversight boards should control data access by examining science, privacy, and ethics (with peer review for data requests); they should assure that data use for research is scientifically rigorous and targets appropriate use in care (otherwise, why subject individuals to any risk?).  Data use agreements (DUAs) should be used and data stewards should remain watchful regarding changing technology and possibilities.  VA’s Million Veteran Program  has adopted a number of policies along these lines.

Overall, transparency and responsible stewardship are the best safeguards in addressing  public concerns about data disclosure.  Prospective database and biobank donors should be made fully aware of any potential risk to privacy, including residual risk and uncertainty regarding re-identification. It has, in fact, been suggested that privacy and disclosure be posed not as absolutes, but rather as points on a continuum.

Past policies whereby databanks offer markedly different restrictions on identified and “de-identified” data should be reconsidered or modified according to realistic expectations regarding privacy. Mechanisms should be developed to take patient preferences into account and informed consent be made clear.  Organizations should also provide training on privacy and security to researchers and implement strong protections against unauthorized data access.

Technology. As a fundamental rule, those managing genomic databases should utilize the most advanced data sharing and access techniques.  To address the specific issue of re-identification problem — particularly when data are widely shared — an important approach is the use of “distributed databases.”   Here, data in each database remain behind a firewall and information is extracted on a “need-to-know” basis. Database managers provide controlled or “computational access” with strict DUAs;  i.e., requests for access must be directed at specific projects.

Currently, many sophisticated IT approaches are also being developed to facilitate this approach.  Going forward, software programs might be used to manage biobank access policies as well as detect hackers.  Automatic protective “disclosure filters” or individualized meta-data tags, which allow use according to patient or other preferences, might be applied to allow only specific items of information to pass through.  Additionally, novel solutions exist to address specific re-identification issues, such as generalizing diagnostic codes and using statistical approaches to choose identifiers.

Public policy.  The public and professional agenda regarding the level and extent of prescriptive public policies toward databases will be informed by public levels of understanding of the benefits of genomic databases; the public trust that derives from adherence to fundamental security and transparency principles; and the professional behavior of database management.

Much has changed since the original privacy laws and research protections discussed earlier were put into place.  As a starting point for addressing existing gaps and bringing policy into alignment with new realities, policymakers should consider the following:  Should there be extensions of the GINA law?  Should HIPAA safe harbor provisions be the gold standard in the light of the latest re-identification approaches? What are the appropriate penalties and sanctions for misuse of data confidentiality (should they be analogous to those for data falsification)?

How should first-degree relatives be protected?  Should public policy go further in mandating structure for databases?  Should Certificates of Confidentially-like approaches be further extended?  Another set of issues revolves around use of genomic date for both research and clinical care, which is not allowed under present rules.

Finding The “Sweet Spot”

Genomic databases hold great promise for improving the health of all and will revolutionize medical research and practice.  It is therefore incumbent upon all those who dwell in the genomic “space” to ensure that the public understands the benefits of big data.  Locating the proverbial “sweet spot” — between sharing genomic data widely with the appropriate body of experts and guarding the privacy of altruistic individuals who volunteer their tissue and health information — is the goal.  Hopefully, we can arrive at more and more wisdom on this issue in all sectors.

Email This Post Email This Post Print This Post Print This Post

Don't miss the insightful policy recommendations and thought-provoking research findings published in Health Affairs.

No Trackbacks for “The Privacy Conundrum And Genomic Research: Re-Identification And Other Concerns”

4 Responses to “The Privacy Conundrum And Genomic Research: Re-Identification And Other Concerns”

  1. Ben7 Says:

    It seems like ‘privacy’ slows innovation in every field

  2. Joel Kupersmith Says:

    Many thanks for the comments. I think that clinicians, researchers, organizations and all of us are looking for that sweet spot where patient privacy is maintained and the enormous good that genomics can bring is realized. We also have to be cautious about the consequences of any policy and legal initiatives.

  3. jhnoblejr Says:

    How can one promise privacy in the new era of NSA spying on citizens. Cloud storage is impossible to protect. Any promises of privacy for data collected by researchers is impossible to guarantee. NIH is currently pushing for waiver of consent for comparative effectiveness research when conducting trials involving interventions within the scope of the so-called medical “standard of care.” The push comes from the controversy arising from the SUPPORT neonatal oxygenation experiment wherein the OHRP made an initial determination that the mothers of the premature infants were not properly apprised of the extent of risk related to the high and low oxygenation level arms of the experiment. If NIH has its way one would not know whether one’s physician is operating as an agent of a comparative effectiveness trial or as one’s personal physician providing the now defined titrated “standard of care.” Will genomic researchers, like the NIH neonatologists, appeal to the “future good of society” as the basis for waiving patient privacy? If so, the balance will tilt in the direction of increased risk for any patient-physician encounter. The earlier in life such encounter, the greater the lifetime risks for the patient. Will parents resist exposing their children to physician contact for fear their children may face future discrimination on the basis of the information that the genome researchers acquire and store? Fashioning appropriate policies will require transparency, open public debate, and legislation relating to these important issues. Citizens cannot depend on self-interested researchers and their funding sources to decide.

  4. Daniel Barth-Jones Says:

    Dr. Kupersmith points us to quite important concerns as the number of genetic databases and as online family pedigree information increases. The potential risks associated with Y-STR re-identification methods are only likely to importantly increase over the coming decades. It’s clearly time for public policy-makers to think deeply about re-identification risks associated with genomic data (not only for this particular recent Y-STR example of genomic re-identification, but for other methods that are sure to be devised in the future) and act to design appropriate protections which will augment imperfect technical safeguards with measures that make such re-identifications socially, legally, and economically unacceptable (http://www.ncbi.nlm.nih.gov/pubmed/23449577).

    Given the inherent extremely large combinatorics of genomic data and the intrinsic biological and social network characteristics that determine how genomic traits (and surnames) are shared with both our ancestors and descendants through genealogic lines, the issues surrounding the degree to which such information can be effectively “de-identified” are non-trivial (http://blogs.law.harvard.edu/billofhealth/2013/05/29/public-policy-considerations-for-recent-re-identification-demonstration-attacks-on-genomic-data-sets-part-1-re-identification-symposium/). While the elimination of the age and location information associated with the re-identified genomic information could have helped to thwart this Y-STR surname inference attack recently reported in Science, the concerns raised by this recent re-identification demonstration call out for enhanced public policy to anticipate re-identification concerns and appropriate protect genetic privacy.

    Among these protections should be strong prohibitions on re-identification, or attempted re-identification, of individuals and their relatives, family or household members. Congress should consider establishing civil and criminal penalties for unauthorized re-identification of de-identified data and for HIPAA Limited Data Sets. Robert Gellman, a privacy and information policy expert, has proposed a well-conceived voluntary legislative-based contractual solution that, with some appropriate modifications and enhancements, could serve as a suitable foundation for such legislative efforts. For example, a carefully designed prohibition on re-identification attempts could still allow research involving re-identification of specific individuals to be conducted under the approval of Institutional Review Boards (IRBs), but would ban such re-identification attempts conducted without essential human subjects research protections. (http://ir.lawnet.fordham.edu/cgi/viewcontent.cgi?article=1277&context=iplj)

Leave a Reply

Comment moderation is in use. Please do not submit your comment twice -- it will appear shortly.

Authors: Click here to submit a post.