Abstract
Free full text
Routes for breaching and protecting genetic privacy
Associated Data
Abstract
We are entering an era of ubiquitous genetic information for research, clinical care and personal curiosity. Sharing these datasets is vital for progress in biomedical research. However, one growing concern is the ability to protect the genetic privacy of the data originators. Here, we present an overview of genetic privacy breaching strategies. We outline the principles of each technique, point to the underlying assumptions, and assess its technological complexity and maturation. We then review potential mitigation methods for privacy-preserving dissemination of sensitive data and highlight different cases that are relevant to genetic applications.
Introduction
We produce genetic information for research, clinical care and out of personal curiosity at exponential rates. Sequencing studies including thousands of individuals have become a reality1,2, and new projects aim to sequence hundreds of thousands to millions of individuals3. Some geneticists envision whole genome sequencing of every person as part of routine health care4,5.
Sharing genetic findings is vital for accelerating the pace of biomedical discoveries and fully realizing the promises of the genetic revolution6. Recent studies suggest that robust predictions of genetic predispositions to complex traits from genetic data will require the analysis of millions of samples7,8. Clearly, collecting cohorts at such scales is typically beyond the reach of individual investigators and cannot be achieved without combining different sources. In addition, broad dissemination of genetic data promotes serendipitous discoveries through secondary analysis, which is necessary to maximize its utility for patients and the general public9.
One of the key issues of broad dissemination is an adequate balance of data privacy10. Prospective participants of scientific studies have ranked privacy of sensitive information as one of their top concerns and a major determinant of participation in a study11–13. Recently, public concerns regarding medical data privacy halted a massive plan of the National Health Service in the UK to create a centralized health-care database14. In addition, protecting personal identifiable information is also a demand of an array of regulatory statutes in the USA and in the European Union15. Data de-identification, the removing of personal identifiers, has been suggested as a potential path to reconcile data sharing and privacy demands16. But is this approach technically feasible for genetic data?
This review categorizes privacy breaching techniques that are relevant to genetic information and maps potential counter-measures. We first categorize privacy-breaching strategies (Figure 1), discuss their underlying technical concepts, and evaluate their performance and limitations (Table 1). Then, we present privacy-preserving technologies, group them according to their methodological approaches, and discuss their relevance to genetic information. As a general theme, we focus only on breaching techniques that involve data mining and fusing distinct resources to gain private information relevant to DNA data. Data custodians should be aware that security threats can be much broader. They can include cracking weak database passwords, classic techniques of hacking the server that holds the data, stealing of storage devices due to poor physical security, and intentional misconduct of data custodians17–19. We do not include these threats since they have been extensively discussed in the computer security field20. In addition, this review does not cover the potential implications of loss of privacy, which heavily depend on cultural, legal and socio-economical context and have been covered in part by the broad privacy literature21,22.
Table 1
Technique | Maturation Level | Technical complexity | Example of auxiliary information | Availability of auxiliary information | Example of a reference |
---|---|---|---|---|---|
Identity Tracing | |||||
Surname Inference | ●●● | Records of Y-chromosome and surnames | Intermediate-Good | 35 | |
DNA Phenotyping | ●● | Population registry of eye color | Poor | 55 | |
Demographic identifiers | ● | Population registry stratified by state | Good | 29 | |
Pedigree structure | ●● | Family trees of the entire population | Poor | 31 | |
Side channel leakage | ●●● | – | Varies | 26 | |
Attribute Disclosure Attacks via DNA (ADAD) | |||||
N=1 | ●● | n/a | n/a | 61 | |
Genotype frequencies | ●●● | Exome Sequencing Project | Good | 63 | |
Linkage disequilibrium | ●●●● | 1000 Genomes | Intermediate | 67 | |
Effect sizes | ●●● | n/a | n/a | 68 | |
Trait inference | ●● | n/a | n/a | 69 | |
Gene expression | ●●●● | GTEx project | Poor | 76 | |
Completion Attacks | |||||
Imputation of a masked marker | ●● | 1000 Genomes | Good | 78 | |
Genealogical imputation (single relative) | ●● | OpenSNP and Facebook profiles | Poor | 79 | |
Genealogical imputation (multiple relatives) | ●●●● | deCode pedigree and DNA | Poor | 80 |
Maturation level:
Technical complexity:
Auxiliary information: this column refers to the level of existing public reference databases for the US population. For identity tracing, it refers to the availability of organized lists that link identities and extract pieces of information. For ADAD and completion techniques, it refers to the existence of supporting reference datasets that are necessary to complete the attack. Poor – supporting data is highly fragmented and not amenable to searches. Intermediate – supporting data is harmonized and searchable but requires some pre-processing. Good – supporting data is searchable using existing tools or minimal pre-processing.
Identity Tracing attacks
The goal of identity tracing attacks is to uniquely identify an anonymous DNA sample using quasi-identifiers – residual pieces of information that are embedded in the dataset. The success of the attack depends on the information content that the adversary can obtain from these quasi-identifiers relative to the size of the base population (Box 1).
Searching with meta-data
Genetic datasets are typically published with additional metadata, such as basic demographic details, inclusion and exclusion criteria, pedigree structure, as well as health conditions that are critical to the study and for secondary analysis. These pieces of metadata can be exploited to trace the identity of the unknown genome.
Unrestricted demographic information conveys substantial power for identity tracing. It has been estimated that the combination of date of birth, sex, and 5-digit zip code uniquely identifies more than 60% of US individuals23,24. In addition, there are extensive public resources with broad population coverage and search interfaces that link demographic quasi-identifiers to individuals, including voter registries, public record search engines (such as PeopleFinders.com) and social media. An initial study reported the successful tracing of the medical record of the Governor of Massachusetts using demographic identifiers in hospital discharge information25. Another study reported the identification of 30% of Personal Genome Project (PGP) participants by demographic profiling that included zip code and exact birthdates found in PGP profiles26.
Since the inception of the Health Insurance Portability and Accountability Act (HIPAA) Privacy rule, dissemination of demographic identifiers have been the subject of tight regulation in the US health care system27. The safe harbor provision requires that the maximal resolution of any date field, such as hospital admissions, will be in years. In addition, the maximal resolution of a geographical subdivision is the first three digits of a zip code (for zip codes of populations of greater than 20,000). Statistical analyses of the census data and empirical health records have found that the Safe Harbor provision provides reasonable immunity against identity tracing assuming that the adversary has access only to demographic identifiers. The combination of sex, age, ethnic group, and state is unique in less than 0.25%of the populations of each of the states28,29.
Pedigree structures are another piece of metadata that are included in many genetic studies. These structures contain rich information, especially when large kinships are available30. A systematic study analysed the distribution of 2,500 two-generation family pedigrees that were sampled from obituaries from a US town of 60,000 individuals31. Only the number (but not the order) of male and female individuals in each generation was available. Despite this limited information, about 30% of the pedigree structures were unique, demonstrating the large information content that can be obtained from such data.
Another vulnerability of pedigrees is combining demographic quasi-identifiers across records to boost identity tracing despite HIPAA protections. For example, consider a large pedigree that states the age and state of all participants. The age and state of each participant leaks very minimal information, but knowing the ages of all first and second-degree relatives of an individual dramatically reduces the search space. Moreover, once a single individual in a pedigree is identified, it is easy to link between the identities of other relatives and their genetic datasets. The main limitation of identity tracing using pedigree structures alone is their low searchability. Family trees of most individuals are not publicly available, and their analysis requires indexing a large spectrum of genealogical websites. One notable exception is Israel, where the entire population registry was leaked to the web in 2006, allowing the construction of multi-generation family trees of all Israeli citizens32.
Identity tracing by genealogical triangulation
Genetic genealogy attracts millions of individuals interested in their ancestry or in discovering distant relatives33. To that end, the community has developed impressive online platforms to search for genetic matches, which can be exploited by identity tracers. One potential route of identity tracing is surname inference from Y-chromosome data34,35 (Figure 2). In most societies, surnames are passed from father to son, creating a transient correlation with specific Y chromosome haplotypes36,37. The adversary can take advantage of the Y chromosome–surname correlation and compare the Y haplotype of the unknown genome to haplotype records in recreational genetic genealogy databases. A close match with a relatively short time to the most common recent ancestor (MRCA) would signal that the unknown genome likely has the same surname as the record in the database.
The power of surname inference stems from exploiting information from distant patrilineal relatives of the unknown’s genome. Empirical analysis estimated that 10–14% of US white male individuals from the middle and upper classes are subject to surname inference based on scanning the two largest Y-chromosome genealogical websites with a built-in search engine35. Individual surnames are relatively rare in the population, and in most cases a single surname is shared by less than 40,000 US male individuals35, which is equivalent to 13 bits of information (Box 1). In terms of identification, successful surname recovery is nearly as powerful as finding one’s zip code. Another feature of surname inference is that surnames are highly searchable. From public record search engines to social networks, numerous online resources offer query interfaces that generate a list of individuals with a specific surname. Surname inference has been utilized to breach genetic privacy in the past38–41. Several sperm donor conceived individuals and adoptees successfully used this technique on their own DNA to trace their biological families. In the context of research samples, a recent study reported five successful surname inferences from Illumina datasets of three large families that were part of the 1000 Genomes project, which eventually exposed the identity of nearly fifty research participants35.
The main limitation of surname inference is that haplotype matching relies on comparing Y chromosome Short Tandem Repeats (Y-STRs). Currently, most sequencing studies do not routinely report these markers, and the adversary would have to process large-scale raw sequencing files with a specialized tool42. Another complication is false identification of surnames and inference of surnames with spelling variants compared to the original surname. Eliminating incorrect surname hits necessitates access to additional quasi-identifiers such as pedigree structure and typically requires a few hours of manual work. Finally, in certain societies, a surname is not a strong identifier and its inference does not provide the same power for re-identification as in the USA. For example, 400 million people in China hold one of the ten common surnames36, and the top hundred surnames cover almost 90% of the population43, dramatically reducing the utility of surname inference for re-identification.
An open research question is the utility of non Y chromosome markers for genealogical triangulation. Websites such as Mitosearch.org and GedMatch.com run open searchable databases for matching mitochondrial and autosomal genotypes, respectively. Our expectation is that mitochondrial data will not be very informative for tracing identities. The resolution of mitochondrial searches is low due to the small size of the mitochondrial genome, meaning that a large number of individuals share the same mitochondrial haplotypes. In addition, matrilineal identifiers such as surname or clan are relatively rare in most human societies, complicating the usage of mitochondria haplotype for identity tracing. Autosomal searches on the other hand can be quite powerful. Genetic genealogy companies have started to market services for dense genome-wide arrays that enable the identification of distant relatives (on the order of 3rd to 4th cousins) with fairly sufficient accuracy44. These hits would reduce the search space to no more than a few thousand individuals45. The main challenge of this approach would be to derive a list of potential people from a genealogical match. As we stated earlier, family trees of most individuals are not publicly available, making such searches a very demanding task that would require indexing a large spectrum of genealogical websites. With the growing interest in genealogy, this technique might be easier in the future and should be taken into consideration.
Identity tracing by phenotypic prediction
Several reports on genetic privacy have envisioned that predictions of visible phenotypes from genetic data could serve as quasi-identifiers for identity tracing46,47. Twin studies have estimated high heritabilities for various visible traits such as height48 and facial morphology49. In addition, recent studies show that age prediction is possible from DNA specimens derived from blood samples50,51. But the applicability of these DNA-derived quasi-identifiers for identity tracing has yet to be demonstrated.
The major limitation of phenotypic prediction is the fast decay of the identification power with small inference errors (Box 1). Current genetic knowledge explains only a small extent of the phenotypic variability of most visible traits, such as height52, body mass index (BMI)53, and face morphology54, substantially limiting their utility for identification. For example, perfect knowledge about height at one-centimeter resolution conveys 5 bits of information. However, with current genetic knowledge that explains 10% of height variability52, the adversary learns only 0.15 bits of information. Predictions of face morphology and BMI are much worse8,54. The exceptions in visible traits are eye colour55 and age prediction50. Recent studies show a prediction accuracy of 75–90% of the phenotypic variability of these traits. But even these successes translate to no more than 3–4 bits of information. Another challenge for phenotypic prediction is the low searchability of some of these traits. We are not aware of population-wide registries of height, eye colour or face morphology that are publicly accessible and searchable. However, future developments in social media might circumvent this barrier.
Identity tracing by side-channel leaks
Side-channel attacks exploit quasi-identifiers that are unintentionally encoded in the database building blocks and structure rather than the actual data that is meant to be public. A good example for such leaks is the exposure of the full names of PGP participants from filenames in the database26. The PGP allowed participants to upload 23andMe genotyping files to their public profile webpages. While it seemed that these do not contain explicit identifiers, after downloading and decompressing the 23andMe file, the original filename, whose default is the first and last name of the user, appeared. Since most of the users did not change the default naming convention, it was possible to trace the identity of a large number of PGP profiles. The PGP now offers instructions to participants how to rename files before uploading and warns them that the file may contain hidden information that can expose their identities. Generally, certain types of files, such as Microsoft Office products, can embed deleted text or hidden identifiers56. Data custodians should be aware that mere scanning of the file content might not always be sufficient to ensure that all identifiers have been removed.
The mechanism to generate database accession numbers can also leak personal information. For example, in a top medical data mining contest, the accession numbers revealed the disease status of the patient, which was the aim of the contest57. In addition, pattern analysis of a large amount of public data revealed temporal and spatial commonalities in the assignment system that allowed predictions of US social security numbers (SSNs) from quasi-identifiers58. Some suggested the assignment of accession numbers by applying cryptographic hashing to the participant identifiers, such as name or SSN59. However, this technique is extremely vulnerable to dictionary attacks due to the relatively low search space of the input. In general, it is advisable to add some sort of randomization to procedures that generate accession numbers.
Attribute disclosure attacks via DNA (ADAD)
Consider the following scenario Alice interviews Bob for a certain position
After the interview, Alice recovers Bob’s DNA and uses this data to search a large genetic study of drug abuse. The study stores the DNA in anonymous form, but a match between Bob’s DNA and one of the records reveals that Bob was a drug abuser. While the short story above has some practical limitations, it illustrates the main concepts of ADAD attack. The adversary gains access to the DNA sample of the target. He or she uses the identified DNA to search genetic databases with sensitive attributes (for example, drug abuse). A match between the identified DNA and the database links the person and the attribute.
The n=1 scenario
The simplest scenario of ADAD is when the sensitive attribute is associated with the genotype data of the individual. The adversary can simply match the genotype data that is associated with the identity of the individual and the genotype data that is associated with the attribute. Such an attack requires only a small number of autosomal single nucleotide polymorphisms (SNPs). Empirical data showed that a carefully chosen set of 45 SNPs is sufficient to provide matches with a type I error of 10−15 for most of the major populations across the globe60. Moreover, random subsets of ~300 common SNPs yield sufficient information to uniquely identify any person61. As such, an individual’s genome is a strong identifier. In general, ADAD is a theoretical vulnerability of virtually any individual level DNA-derived omics dataset such as RNA-seq and personal proteomics.
Genome-wide association studies (GWAS) are highly vulnerable to ADAD. In order to address this issue, several organizations, including the NIH, have adopted a two-tier access system for GWAS datasets: a restricted access area that stores individual level genotypes and phenotypes and a public access area for high level data summary statistics of allele frequencies for all cases and controls62. The premise of this distinction was that summary statistics enable secondary data usage for meta-GWAS analysis while it was thought that this type of data is immune to ADAD.
The summary statistic scenario
A landmark study in 2008 reported the possibility of ADAD on GWAS datasets that only consist of the allele frequencies of the study participants63. The underlying concept of this approach is that, with the target genotypes in the case group, the allele frequencies will be positively biased towards the target genotypes compared to the allele frequencies of the general population. A good illustration of this concept is considering an extremely rare variation in the subject’s genome. Non-zero allele frequency of this variation in a small-scale study increases the likelihood that the target was part of the study, whereas zero allele frequency strongly reduces this likelihood. By integrating the slight biases in the allele frequencies over a large number of SNPs, it is also possible to conduct ADAD with the common variations that are analysed in GWAS.
Subsequent studies extended the range of vulnerabilities for summary statistics. One line of studies improved the test statistic in the original work and analysed its mathematical properties64–66. Under the assumption of common SNPs in linkage-equilibrium (LD), the improved test statistic is mathematically guaranteed to yield maximal power for any specificity level (Box 2). Another group went beyond allele frequencies and demonstrated that it is possible to exploit local LD structures for ADAD67. The power of this approach stems from scavenging for the co-occurrence of two relatively uncommon alleles in different haplotype blocks that together create a rare event. Another study developed a method to exploit the effect sizes of GWAS involving quantitative traits to detect the presence of the target68. A powerful development of this study is exploiting GWAS studies that utilize the same cohort for multiple phenotypes. The adversary repeats the identification process of the target with the effect sizes of each phenotype and integrates them to boost the identification performance. After determining the presence of the target in a quantitative trait study, the adversary can further exploit the GWAS data to predict the phenotypes with high accuracy69.
The actual risk of ADAD has been the subject of intense debate. Following the original 2008 study63, the NIH and other data custodians moved their GWAS summary statistics data from public databases to access-controlled databases such as dbGAP70. A retrospective analysis found that significantly fewer GWAS studies publicly released their summary statistics data after the discovery of this attack71. As of now, most of the studies publish summary statistic data on 10–500 SNPs, which is compatible with one suggested guideline to manage risk69. However, some researchers have warned that these policies are too harsh72. There are several practical complications that the adversary needs to overcome to launch a successful attack, such as access to the target’s DNA data73 and accurate matching between the target ancestries and those listed in the reference database74. Failure to address any of these prerequisites can severely impact the performance of the ADAD. In addition, for a range of GWAS studies, the associated attributes are not sensitive or private (for example, height). Thus, even if ADAD occurs, the impact on the participant should be minimal. A recent NIH workshop has proposed the release of summary statistics as the default policy and the development of an exemption mechanism for studies with increased risk due to the sensitivity of the attribute or the vulnerability level of the summary data75.
The gene expression scenario
Databases such as the NIH’s Gene Expression Omnibus (GEO) publicly hold hundreds of thousands of gene expression profiles from human that are linked to a range of medical attributes. A recent study proposed a potential route to exploit these profiles for ADAD76. The method starts with a training step that employs a standard expression quantitative trait loci (eQTL) analysis with a reference dataset. The goal of this step is to identify several hundred strong eQTLs and to learn the expression level distributions for each genotype. Next, the algorithm scans the public expression profiles. For each eQTL, it uses a Bayesian approach to calculate the probability distributions of the genotypes given the expression data. Last, the algorithm matches the target’s genotype with the inferred allelic distributions of each expression profile and tests the hypothesis that the match is random. If the null hypothesis is rejected, the algorithm links the identity of the target to the medical attribute in the gene expression experiment. This ADAD technique has the potential for relatively high accuracy in ideal conditions. Based on large-scale simulations, the authors predicted that the method can reach a type I error of 1×10−5 with a power of 85% when tested on an expression database of the entire US population.
There are several practical limitations to ADAD via expression data. While the training and inference steps are capable of working with expression profiles from different tissues, the method reaches its maximal power when the training and inference utilize eQTL from the same tissue. Additionally, there is a substantial loss of accuracy when the expression data in the training phase is collected using a different technology than the expression data in the inference phase. Another complication is that in order to fully execute the technique on a large database such as GEO, the adversary will need to manage and process substantial amounts of expression data. Due to the technical complexities, the NIH did not issue any changes to their policies regarding sharing expression data from human subjects.
Completion attacks
Completion of genetic information from partial data is a well-studied task in genetic studies, called genotype imputation77. This method takes advantage of the linkage disequilibrium between markers and uses reference panels with complete genetic information to restore missing genotype values in the data of interest. The very same strategies enable the adversary to expose certain regions of interest where only partial access to the DNA data is available. In a famous example of a completion attack, a recent study showed that it is possible to infer Jim Watson’s predisposition for Alzheimer’s disease from the ApoE locus despite masking of this gene78. As a result of the study, a 2Mb segment around the ApoE gene was removed from Watson’s published genome.
In some cases, completion techniques also enable the prediction of genomic information when there is no access to the DNA of the target. This technique is possible when genealogical information is available in addition to genetic data. In the basic setting, the adversary obtains access to a single genetic dataset of a known individual. He then exploits this information to estimate genetic predispositions for relatives whose genetic information is inaccessible. A recent study demonstrated the feasibility of this attack by taking advantage of self-identified genetic datasets from OpenSNP.org, an internet platform for public sharing of genetic information79. Using Facebook searches, the research team was able to find relatives of the individuals that self-identified their genetic datasets. Next, the team predicted the genotypes of these relatives and estimated their genetic predisposition to Alzheimer’s using a Bayesian approach.
In the advanced setting, the adversary has access to the genealogical and genetic information of multiple relatives of the target80. The algorithm finds relatives of the target that donated their DNA to the reference panel and that reside on a unique genealogical path that includes the target, for example, a pair of half-first cousins when the target is their grandfather. A shared DNA segment between the relatives indicates that the target has the same segment. By scanning more pairs of relatives that are connected through the target, it is possible to infer the two copies of autosomal loci and collect more genomic information on the target without any access to his DNA. This approach is more accurate than the basic setting and enables to infer genotypes of more distant relatives. In Iceland, decode genetics leveraged their large reference panel and genealogical information to infer genetic variants of an additional 200,000 living individuals who never donated their DNA81. In May 2013, Iceland’s Data Protection Authority prohibited the use of this technique until consent is obtained from the individuals who are not part of the original reference panel.
Mitigation techniques
Most of the genetic privacy breaches presented above require a background in genetics and statistics and – importantly – a motivated adversary. One school of thought posits that these practical complexities markedly diminish the probability of an adverse event82,83. In this view, an appropriate mitigation strategy is to simply remove obvious identifiers from the datasets before publicly sharing the information. In the field of computer security, this risk management strategy is called security by obscurity. The opponents of security by obscurity posit that risk management schemes based on the probability of an adverse event are fragile and short lasting. Technologies only get better with time and what is technically challenging but possible today will be much easier in the future. In other words, it is impossible to estimate future risks of adverse events84. Known in cryptography as Shannon’s maxim85, this school of thought assumes that the adversary exists and is equipped with the knowledge and means to execute the breach. Robust data protection, therefore, is achieved by explicit design of the data access protocol rather than by relying on the small chances of a breach86.
Access control
Privacy risks are both amplified and more uncertain when data is shared publicly with no record of who accesses it. An alternative is to place sensitive data in a secure location and to screen the legitimacy of the applicants and their research projects by specialized committees. Once approval is made, the applicants are allowed to download the data under the conditions that they will store it in a secure location and will not attempt to identify individuals. In addition, the applicants are required to file periodic reports about the data usage and any adverse events. This approach is the cornerstone of dbGAP62,87. Based on periodic reports by users, a retrospective analysis of dbGAP access control has identified 8 data management incidents out of close to 750 studies, mostly involving non-adherence to the technical regulations, with no reports of breaching the privacy of participants88.
Despite the absence of privacy breaches thus far, some have criticized the lack of real oversight once the data is in the hand of the applicant89. An alternative model uses a trust-but-verify approach, where users cannot download the data without restriction but, based on their privileges, may execute certain types of queries, which are recorded and audited by the system90,91. Supporters of this model state that monitoring has the potential to deter malicious users and to facilitate early detection of adverse events. One technological challenge is that audit systems usually rely on anomalous behavior to detect adversaries92. It is yet to be proven that such methods can reliably distinguish between legitimate and malicious use of genetic data. Auditing also requires that any interaction with the genetic datasets is done using a standard set of API calls that can be analyzed. By contrast, most of the genomic formats currently operate using more liberal text parsing approaches, but several efforts in the community have been made to standardize genomic analysis93,94.
Another model of access control is allowing the original participants to grant access to their data instead of delegating this responsibility to a data access committee95,96. This model centers on dynamic consent based on on-going communication between researchers and participants regarding data access. Supporters of this model state that this approach streamlines the consent process, enables participants to modify their preferences throughout their lifetimes, and can promote greater transparency, higher levels of participant engagement, and oversight. An example for such an effort is PEER (Platform for Engaging Everyone Responsibly). In this setting, Private Access Inc. operates a service that manages the access rights and mediates the communication between researchers and participants, without revealing the identity of the participants. A trusted agent, Genetic Alliance, holds the participants health data, offers stewardship regarding privacy preferences, and grant access to data based on participants’ decisions. Participant-based access control is still a relatively new method. As data custodians gain more experience with such a framework, a better picture will emerge regarding its utility as an alternative for risk-benefit management compared to traditional access control methodologies.
Data anonymization
The premise of anonymity is the ability to be ‘lost in the crowd’. One line of studies suggested restoring anonymity by restricting the granularity of quasi-identifiers to the point that no record in the database has a unique combination of quasi-identifiers. One heuristic is k-anonymity97, in which attribute values are generalized or suppressed such that for each record there are at least ‘k-1’ records with the same combination of quasi-identifiers. To maximize the utility of the data for subsequent analysis, the generalization process is adaptive. Certain records will have a lower resolution depending on the distribution of the other records and certain data categories that are too unique are suppressed entirely. There is a strong trade-off in the selection of the value of k; high values better protect privacy but at the same time reduce the utility of the data. As a rule of thumb, k=5 is commonly used in practice98. Most of the k-anonymity work centers on protecting demographic identifiers. For genetic data, one study suggested a 2-anonymity protocol by generalizing the 4 nucleotides in DNA sequences into broader types of biochemical groups such as pyrimidine and purines99. However, the utility of such data for broad genetic applications is unclear. Furthermore, k-anonymity is vulnerable to attribute disclosure attacks when the adversary has prior knowledge about the presence of the target in the database100,101. Thus, while this heuristic is easy to comprehend, its privacy properties as well as its relevance to genomic studies are in question.
Differential privacy is an emerging methodology for privacy-preserving reporting of results, primarily of summary statistics102 (Box 3). In contrast to k-anonymity, this method guarantees privacy against an adversary with arbitrary prior knowledge. Differential privacy operates by adding noise to the results before their release. The algorithm tunes the amount of noise such that the reported results will be statistically indistinguishable from similar reported results that would have been obtained if a single record had been removed from the original dataset. This way, an adversary with any type of prior knowledge can never be sure whether a specific individual was part of the original dataset because the data release process produces results that are almost exactly the same if the individual was not included. Due to its theoretical guarantees and tractable computational demands, differential privacy has become a vibrant research area in computer science and statistics. In perhaps the best-known large-scale implementation, the US Census Bureau utilizes this technique for privacy-preserving release of data in the online OnTheMap tool103.
In the context of genetic privacy, several studies have explored differential private release of common summary statistics of GWAS data, such as the allele frequencies of cases and controls, χ2-statistic, and p-values104,105 or shifting the original locations of variants106. Currently, these techniques require a large amount of noise even for the release of a GWAS statistics from a small number of SNPs, which renders these measures impractical. It is unclear whether there is a perturbation mechanism that can add much smaller amounts of noise to GWAS results while satisfying the differential privacy requirement, or whether perturbation can be shown to be effective for privacy preservation under a different theoretical model.
Cryptographic solutions
Modern cryptography brought new advancements to data dissemination beyond the traditional usage of encrypting sensitive information and distributing the key to authorized users. These solutions enable well-defined usability of data while blocking unauthorized operations. Different from solutions in the previous section, the underlying data is not perturbed within the authorized usability.
One line of cryptographic work considers the problem of privacy-preserving approaches to outsource computation on genetic information to third parties. For example, with the advent of ubiquitous genetic data, patients (or their physicians) will interact throughout their lives with a variety of online genetic interpretation applications, such as promethease.com, increasing the chance of a privacy breach. Recent cryptographic work has suggested homomorphic encryption (Box 4) for secure genetic interpretation107. In this method, users send encrypted versions of their genomes to the cloud. The interpretation service can access the cloud data but does not have the key and therefore cannot read the plain genotype values. Instead, the interpretation service executes the risk prediction algorithm on the encrypted genotypes. Due to the special mathematical properties of the underlying cryptosystem, the user simply decrypts the results given by the interpretation service to obtain the risk prediction. This way, the user does not expose genotypes or disease susceptibility to the service provider and interpretation companies can offer their service to users concerned with privacy. Preliminary results have highlighted the potential feasibility of this scheme108. A proof-of-concept study encrypted the variants of a 1000 Genomes individual and simulated a secure inference of heart disease risk based on 23 SNPs and 17 environmental factors. The total size of the encrypted genome was 51 Gbyte and the risk calculation took 6 minutes on a standard computer. The current scope of risk prediction models is still narrow but this approach might be quite amenable to future improvements.
Cryptographic studies have also considered the task of outsourcing read mapping without revealing any genetic information to the service provider109–111. The basis of some of these protocols is Secure Multiparty Computation (SMC). SMC allows two or more entities who each have some private data to execute a computation on these private inputs without revealing the input to each other or disclosing it to a third party. In one classic example of SMC, two individuals can determine who is richer without either one revealing their actual wealth to the other112. Earlier studies suggested SMC versions for edit distance-based mapping of DNA sequences that does not reveal their content109,110. However, regular (unsecure) edit distance-based mapping is too slow to handle the volumes of high-throughput sequencing reads, narrowing the applications for the much-slower secure version. A more recent study proposed a privacy-preserving version of the popular seed-and-extend algorithm111, which serves as the basis of several high-throughput alignment tools111,113. The privacy-preserving version is a hybrid: the seeding part is securely outsourced to a cloud where a cryptographic hashing hides the actual DNA sequences while permitting string matching. The cloud results are streamed to a local trusted computer that performs the extension part. By tuning the underlying parameters of the seed-and-extend algorithm, this method puts most of the computation burden on the cloud. Experiments with real sequencing data showed that the cloud performs >95% of the computation efforts. In addition, the secure algorithm takes only 3.5× longer than a similar unsecure implementation, suggesting a tractable price tag to maintain privacy.
Beyond outsourcing of computation, several studies designed cryptographically secure algorithms for searching genetic databases. One study suggested searchable genetic databases for forensic purposes that allow only going from genetic data to identity but not from identity to genetic data114. The forensic database stores the individuals’ names and contact information in an encrypted form. The key for each entry is the corresponding individual’s genotypes. This way, knowing genotype information (for example, from a crime scene) can reveal the identity but not the opposite. In addition, to tolerate genotyping errors or missing data, the study suggested a fuzzy encryption scheme in which a decryption key can approximately match the original key. Another cryptographic protocol proposed matching genetic profiles between two parties for paternity tests or carrier screening without exposing the actual genetic data115,116. A smartphone-based implementation was presented for one version of this algorithm117. A recent study suggested a scalable approach for finding relatives using genome-wide data without disclosing the raw genotypes to a third party or other participants118. First, users collectively decide the minimal degree of relatedness they wish to accept. Next, each user posts a secure version of her genome to a public repository using a fuzzy encryption scheme. Then, users compare their own secure genome to the secure genomes of other users. Comparison of two encrypted genomes reveals no information if the genomes are farther than the threshold degree of relatedness; otherwise, it reveals the exact genetic distance. An evaluation of the efficacy of this approach via experiments with hundreds of individuals from the 1000 Genomes Project showed that even second-degree relatives can reliably find each other118.
A major open question is whether cryptographic protocols can facilitate data sharing for research purposes. So far, cryptographic schemes have focused on developing protocols for GWAS analysis without the need to reveal individual-level genetic data. One study presented a scheme where genetic data and computation of GWAS contingency tables are securely outsourced via homomorphic encryption to external data centers119. A trusted party (for example, the NIH) acts as a gateway that accepts requests from researchers in the community, instructs the data centres to perform computation on the encrypted data, and decrypts and disseminates the GWAS results back to the researchers. A more recent study tested a scheme to generate GWAS summary statistics without a trusted party using only SMC between the data centeres120. Another study evaluated the outsourcing of GWAS analysis to a commercially available tamper-resistant hardware121. Different from the schemes above119,120, the individual-level genotypes are decrypted as part of the GWAS summary statistics computation but the exposure occurs for a short amount of time in a secure hardware environment, which prevents any leakage. All of the cryptographic GWAS schemes above suffer from one common drawback: the protocols produce summary statistics, which are theoretically amenable to ADAD methods. As of today, cryptography has yet to devise a comprehensive data sharing solution for GWAS studies.
Conclusions
In the last few years, a torrent of studies has suggested that a motivated, technically sophisticated adversary is capable of exploiting a wide range of genetic data. With the constant innovation in genetics and the explosion of online information, we can expect that new privacy breaching techniques will be discovered in the next few years and that technical barriers to conducting existing attacks will diminish. On the other hand, privacy-preserving strategies for data dissemination are a vibrant area of research. Rapid progress has been made, and powerful frameworks such as differential privacy and homomorphic encryption are now part of the mitigation arsenal. At least for certain tasks in genetics, there are protocols that preserve the privacy of individuals. However, protecting privacy is only one facet of the solution. Lessons from computer security have highlighted that usability is a key component for the wide adoption of secure protocols. Successful implementations should hide unnecessary technical details from the users, minimize the computational overhead, and enable legitimate research122,123. We have yet to fully achieve this aim.
In addition, successful balancing of privacy demands and data sharing is not restricted to technical means124. Balanced informed consent outlining both benefits and risks are key ingredients for maintaining long-lasting credibility in genetic research. With the active engagements of a wide range of stakeholders from the broad genetics community and the general public, we as a society can facilitate the development of social and ethical norms, legal frameworks, and educational programs to reduce the chance of misuse of genetic data regardless of the ability to identify datasets.
Acknowledgments
YE is an Andria and Paul Heafy Family Fellow and holds a Career Award at the Scientific Interface from the Burroughs Wellcome Fund. This study was supported by a National Human Genome Research Institute grant R21HG006167 and by a gift from Cathy and Jim Stone. The authors thank Dina Zielinski and Melissa Gymrek for useful comments.
Glossary
SAFE HARBOR | A standard in the HIPAA Rule for de-identification of protected health information by removing 18 types of quasi-identifiers |
HAPLOTYPES | A set of alleles along the same chromosome |
CRYPTOGRAPHIC HASHING | A procedure that yields a fixed length output from any size of input in a way that is hard to determine the input from the output |
DICTIONARY ATTACKS | A brute force approach to reverse cryptographic hashing by scanning the relatively small input space |
ALICE AND BOB | Common generic names in computer security to denote party A and party B |
TYPE I ERROR | The probability to obtain a positive answer from a negative item |
LINKAGE EQUILIBRIUM | Absence of correlation between the alleles in two loci |
POWER | The probability to obtain a positive answer for a positive item |
SPECIFICITY | The probability to obtain a negative answer for a negative item |
EFFECT SIZES | The contribution of an allele to the value of the trait |
POSITIVE PREDICTIVE VALUE | The probability that a positive answer belongs to a true positive |
EXPRESSION QUANTITATIVE TRAIT LOCI | Genetic variants associated with variability in gene expression |
GENOTYPE IMPUTATION | A class of statistical techniques to predict a genotype from information on surrounding genotypes |
LINKAGE DISEQUILIBRIUM | The correlation between alleles in two loci |
API | A set of commands that specify the interface with a dataset |
χ2 STATISTIC | A measure of association in case-control GWAS studies |
READ MAPPING | A computational intensive step in high throughput sequencing to find the location of a DNA strings in the genome |
EDIT DISTANCE | The total number of insertions, deletions, and substitution between two strings |
Biographies
Yaniv Erlich is a Fellow at the Whitehead Institute for Biomedical Research. Erlich received his Ph.D. from Cold Spring Harbor Laboratory in 2010 and B.Sc. from Tel-Aviv University in 2006. Prior to that, Erlich worked in computer security and was responsible for conducting penetration tests on financial institutes and commercial companies. Dr. Erlich’s research involves developing new algorithms for computational human genetics.
Arvind Narayanan is an Assistant Professor in the Department of Computer Science and the Center for Information Technology and Policy at Princeton. He studies information privacy and security. His research has shown that data anonymization is broken in fundamental ways, for which he jointly received the 2008 Privacy Enhancing Technologies Award. His current research interests include building a platform for privacy-preserving data sharing.
Footnotes
Competing interests statement
None.
References
Full text links
Read article at publisher's site: https://doi.org/10.1038/nrg3723
Read article for free, from open access legal sources, via Unpaywall: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4151119
Citations & impact
Impact metrics
Citations of article over time
Alternative metrics
Smart citations by scite.ai
Explore citation contexts and check if this article has been
supported or disputed.
https://scite.ai/reports/10.1038/nrg3723
Article citations
Les tests génétiques gratuits ne sont pas audessus des considérations cliniques et éthiques.
CMAJ, 196(35):E1214-E1216, 21 Oct 2024
Cited by: 0 articles | PMID: 39433312 | PMCID: PMC11498341
Application of machine learning based genome sequence analysis in pathogen identification.
Front Microbiol, 15:1474078, 02 Oct 2024
Cited by: 0 articles | PMID: 39417073 | PMCID: PMC11480060
Review Free full text in Europe PMC
Promoting Data Sharing: The Moral Obligations of Public Funding Agencies.
Sci Eng Ethics, 30(4):35, 06 Aug 2024
Cited by: 1 article | PMID: 39105890 | PMCID: PMC11303567
Free genetic testing is not free of clinical and ethical considerations.
CMAJ, 196(26):E910-E911, 28 Jul 2024
Cited by: 0 articles | PMID: 39074856 | PMCID: PMC11286175
Privacy-Enhancing Technologies in Biomedical Data Science.
Annu Rev Biomed Data Sci, 7(1):317-343, 01 Aug 2024
Cited by: 0 articles | PMID: 39178425 | PMCID: PMC11346580
Review Free full text in Europe PMC
Go to all (137) article citations
Data
Data behind the article
This data has been text mined from the article, or deposited into data resources.
BioStudies: supplemental material and supporting data
Similar Articles
To arrive at the top five similar articles we use a word-weighted algorithm to compare words from the Title and Abstract of each citation.
Control use of data to protect privacy.
Science, 347(6221):504-506, 01 Jan 2015
Cited by: 5 articles | PMID: 25635089
Privacy challenges and research opportunities for genomic data sharing.
Nat Genet, 52(7):646-654, 29 Jun 2020
Cited by: 67 articles | PMID: 32601475 | PMCID: PMC7761157
Review Free full text in Europe PMC
An overview of human genetic privacy.
Ann N Y Acad Sci, 1387(1):61-72, 14 Sep 2016
Cited by: 16 articles | PMID: 27626905 | PMCID: PMC5697154
Review Free full text in Europe PMC
Privacy enhancing techniques - the key to secure communication and management of clinical and genomic data.
Methods Inf Med, 42(2):148-153, 01 Jan 2003
Cited by: 12 articles | PMID: 12743651
Funding
Funders who supported this work.
NHGRI NIH HHS (2)
Grant ID: R21 HG006167
Grant ID: R21HG006167