No matter what the arena — finance, health care, or national security — questions surrounding the provision of personal data are always the same:  how much benefit vs. how much risk?  Who handles these data, and can those individuals be trusted?  How do organizations guard against data misuse?   What are the legal safeguards to protect privacy, and are they sufficient in an era when more data are shared more widely?

Nowhere is the privacy discussion more personal than in genomics, the very hardwiring of our existence.  Genomic data are unique to individuals (or identical twins) and, except for occasional mutations, do not change over a lifetime, thereby rendering disclosures permanent.  Genomic data also have special properties regarding privacy, especially as comprehensive whole genome sequencing becomes the major technique.

The benefits of amassing genomic data in sufficient case numbers for validity and making this knowledge available to an appropriately wide body of expert investigators are extensive. Research derived from genomic databases offers potentially large health payoffs.  Genomics can help scientists predict who will develop a disease (e.g., Huntington’s disease) and tailor treatments.  It also holds the potential to bring about a paradigm shift in how we think about and classify disease; i.e., allowing us to move from the pathology-based approach begun in the late 19th century — which focuses on the progression of disease in a specific organto a biochemical-and genomics-based approach.  This new approach is already being applied to a number of diseases, including certain cancers.

Yet the damage caused by the misuse of genomic data can be irreparable.  Disclosure of genomic information can have stigmatizing consequences for both the individual and family, particularly first-degree relatives.  These consequences include employment discrimination, denial of life insurance, and inappropriate marketing.  And that is the conundrum: Realizing the promise of genomics, while safeguarding genomic data and protecting individual privacy.

Over the past 40 years, federal and state policymakers have addressed various pieces of the privacy puzzle in laws and rules, many of which relate to genomic research.  Among the more notable are the 1974 Privacy Act, which protects systems of federal government records; Certificates of Confidentiality statutes (established for AHRQ and NIH investigators), which restrict data use outside of research and protect research subjects from forced disclosure for various proceedings; the Common Rule, which in 1991 mandated that Institutional Review Boards oversee research privacy and security; HIPAA, which was passed five years later and protects privacy of clinical information via  the “safe harbor” of 18 specific identifiers (research databases are given  leeway to not include some identifiers); and, more recently, the Genetic Information Nondiscrimination Act of 2008, which protects individuals from genetic discrimination via denial of employment or health insurance coverage.

Additional activity at the federal level includes The Presidential Commission for the Study of Bioethical Issues (“Presidential Commission”) and its recent   report, ”Privacy and Progress in Whole Genome Sequencing”, among other initiatives.  There is also a patchwork of privacy policies and laws in about half the states.

Until now, anonymization of data, database security, and these protective laws have been the primary safeguards against security lapses in genomic data made partially or largely available to the public. However, various factors, including formation of very large databases, and data sharing and access by large numbers of individuals have put new strains on genomic security.

The Problem Of Re-Identification

In addition — and most importantly — the recent availability of large amounts of data in various public and non-public registries coupled with advanced approaches to data matching, has resulted in a new form of data hacking.  Under the right circumstances, a persistent individual can overcome barriers to identification and “re-identify” an individual in an anonymized database.  The ability to do so, generally speaking, requires having access to at least two or more databases: a “candidate” database, which holds the anonymized data, and one or more “reference” databases, which share elements (e.g., zip codes) in common.

Re-identification of an individual or small group is essentially achieved by using “quasi-identifiers” to cross-reference specific elements that are included along with the genetic data, but also found in other databases; i.e., elements that can directly or indirectly narrow  down  the  numbers of a given group or subgroup to recognize the subject.

For example, zip codes include many individuals, but they do not directly identify subjects.  However, when zip codes are cross-matched with information from other databases, such as driver vehicle information, the number of individuals is whittled down. Further cross-matching with, for example voter registration files, can whittle the number down even further, until a particular individual is identified.  While this re-identification problem itself is not new, the prospect of re-identifying genomic data has elevated concerns about privacy.

Such concern is justifiable, particularly given the variety of available reference databases: hospital quality data; ICD-9 codes (powerful identifiers);  death master files (in which social security numbers can be adduced); social security databases; vehicular databases; voter registrations lists;  house sales; U.S. census data; public records search engines (e.g.,; and many others.  With regard to genomics databases specifically, private genetic and genealogy databases and even listings of relatives in obituaries can be used as reference databases for matching. The more specific fields are in the reference databases, the better they serve as quasi-identifier.

Because genes have specifically identifiable common elements (haplotypes, mutations) and are also shared with blood relatives, they offer distinct possibilities for re-identification. For example, a recent study showed how individuals could be re-identified in what was previously considered to be an “anonymous” database, the 1000 Genomes.  In addition to genomic data, this database included year of birth and state of residence (neither of which is protected by HIPAA from disclosure).

To re-identify, researchers first identified single nucleotide polymorphisms (SNPs, essentially single point mutations) on the Y (male) chromosome — called Y-STR markers — in the genomes of 32 European ancestry individuals. These particular SNPs distinguish male lines in families. Next, they plugged the Y-STR information into two publicly available recreational genealogy databases, giving them access to about 40,000 surnames and pedigrees.  They found eight matches to Mormon families in Utah.  Then, using information from the Coriell Cell Repository, obituary data, and other public sources, they re-identified nearly 50 individuals from three distinct pedigrees, including both male and female family members.  Only 3-7 hours was needed for a single reviewer to identify a complete pedigree.

After this privacy breach, year of birth was removed from the public database and participants in the 1000 Genomes project were re-consented.  Specifically, they were informed of the broad access anticipated for these genomic data and the remaining possibility of re-identification.  It is important to note that the re-identification was entirely legal; i.e., it was accomplished within the existing framework of current law and regulations as well as policies established by organizations overseeing databases. Further, re-identification similar to the 1000 Genome project  has been accomplished in other genomic databases.

Although special characteristics in this situation arguably facilitated re-identification (uncommon surnames, connection to a less densely populated state such as Utah), the message is clear: the privacy of “de-identified” data in genomic databases is at risk, both for individuals, and family members.  In addition, the already persistent threats of poor database security and purposeful or inadvertent disclosure by those with access to data are now magnified by the need to share information more widely (among those with the appropriate expertise).  As the number of databases and family pedigree information increase, and the number of individuals working with data also grows, so does the risk. As the Presidential Commission recently noted, “Without … (privacy) assurance in place, individuals are less likely to voluntarily supply the data that have the potential to benefit us all life-saving treatments for genetic diseases.”

A Path Forward

To maintain crucial public trust, organizations need to re-examine existing database policies, technology must be continually updated, and public policy revisited.

Database policies.  Transparency, strict governance and rules, and provision of a ”floor of privacy protection” are essential to good data stewardship. Therefore, any organization responsible for genomic databases should develop an explicit management structure, data access and use policies, and clearly defined roles.  These roles might include functions — access control manager, security manager, etc. — that are analogous to the oversight of Randomized Clinical Trials.

As part of this structure, oversight boards should control data access by examining science, privacy, and ethics (with peer review for data requests); they should assure that data use for research is scientifically rigorous and targets appropriate use in care (otherwise, why subject individuals to any risk?).  Data use agreements (DUAs) should be used and data stewards should remain watchful regarding changing technology and possibilities.  VA’s Million Veteran Program  has adopted a number of policies along these lines.

Overall, transparency and responsible stewardship are the best safeguards in addressing  public concerns about data disclosure.  Prospective database and biobank donors should be made fully aware of any potential risk to privacy, including residual risk and uncertainty regarding re-identification. It has, in fact, been suggested that privacy and disclosure be posed not as absolutes, but rather as points on a continuum.

Past policies whereby databanks offer markedly different restrictions on identified and “de-identified” data should be reconsidered or modified according to realistic expectations regarding privacy. Mechanisms should be developed to take patient preferences into account and informed consent be made clear.  Organizations should also provide training on privacy and security to researchers and implement strong protections against unauthorized data access.

Technology. As a fundamental rule, those managing genomic databases should utilize the most advanced data sharing and access techniques.  To address the specific issue of re-identification problem — particularly when data are widely shared — an important approach is the use of “distributed databases.”   Here, data in each database remain behind a firewall and information is extracted on a “need-to-know” basis. Database managers provide controlled or “computational access” with strict DUAs;  i.e., requests for access must be directed at specific projects.

Currently, many sophisticated IT approaches are also being developed to facilitate this approach.  Going forward, software programs might be used to manage biobank access policies as well as detect hackers.  Automatic protective “disclosure filters” or individualized meta-data tags, which allow use according to patient or other preferences, might be applied to allow only specific items of information to pass through.  Additionally, novel solutions exist to address specific re-identification issues, such as generalizing diagnostic codes and using statistical approaches to choose identifiers.

Public policy.  The public and professional agenda regarding the level and extent of prescriptive public policies toward databases will be informed by public levels of understanding of the benefits of genomic databases; the public trust that derives from adherence to fundamental security and transparency principles; and the professional behavior of database management.

Much has changed since the original privacy laws and research protections discussed earlier were put into place.  As a starting point for addressing existing gaps and bringing policy into alignment with new realities, policymakers should consider the following:  Should there be extensions of the GINA law?  Should HIPAA safe harbor provisions be the gold standard in the light of the latest re-identification approaches? What are the appropriate penalties and sanctions for misuse of data confidentiality (should they be analogous to those for data falsification)?

How should first-degree relatives be protected?  Should public policy go further in mandating structure for databases?  Should Certificates of Confidentially-like approaches be further extended?  Another set of issues revolves around use of genomic date for both research and clinical care, which is not allowed under present rules.

Finding The “Sweet Spot”

Genomic databases hold great promise for improving the health of all and will revolutionize medical research and practice.  It is therefore incumbent upon all those who dwell in the genomic “space” to ensure that the public understands the benefits of big data.  Locating the proverbial “sweet spot” — between sharing genomic data widely with the appropriate body of experts and guarding the privacy of altruistic individuals who volunteer their tissue and health information — is the goal.  Hopefully, we can arrive at more and more wisdom on this issue in all sectors.