Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Jun 1;31(11):1701-7.
doi: 10.1093/bioinformatics/btv018. Epub 2015 Jan 27.

Deterministic identification of specific individuals from GWAS results

Affiliations

Deterministic identification of specific individuals from GWAS results

Ruichu Cai et al. Bioinformatics. .

Abstract

Motivation: Genome-wide association studies (GWASs) are commonly applied on human genomic data to understand the causal gene combinations statistically connected to certain diseases. Patients involved in these GWASs could be re-identified when the studies release statistical information on a large number of single-nucleotide polymorphisms. Subsequent work, however, found that such privacy attacks are theoretically possible but unsuccessful and unconvincing in real settings.

Results: We derive the first practical privacy attack that can successfully identify specific individuals from limited published associations from the Wellcome Trust Case Control Consortium (WTCCC) dataset. For GWAS results computed over 25 randomly selected loci, our algorithm always pinpoints at least one patient from the WTCCC dataset. Moreover, the number of re-identified patients grows rapidly with the number of published genotypes. Finally, we discuss prevention methods to disable the attack, thus providing a solution for enhancing patient privacy.

Availability and implementation: Proofs of the theorems and additional experimental results are available in the support online documents. The attack algorithm codes are publicly available at https://sites.google.com/site/zhangzhenjie/GWAS_attack.zip. The genomic dataset used in the experiments is available at http://www.wtccc.org.uk/ on request.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Example of GWAS result publication and the privacy attack. Top left: part of the raw data of the GWAS, which contains genome sequences for study participants. Bottom: published results of the GWAS, which lists the genotypes of interest, their frequencies and correlation with the disease, as well as the correction between each pair of these genotypes. Right column: the proposed privacy attack, which first recovers a co-occurrence matrix from the published statistics (only M11 is given for space limitation) and uses this matrix to build presence proofs, i.e. sets of genotypes that must be present among the cases
Fig. 2.
Fig. 2.
Presence proofs (length-2 and longer) and their generation. The attack also infers the frequency of each proof. When the frequency cannot be uniquely determined, the attack derives an upper bound and a lower bound for the frequency (as shown for the rightmost proof of length 3). A proof of length l is generated by combining two proofs of length l-1 that differ in exactly one genotype
Fig. 3.
Fig. 3.
The number of re-identified cases in the seven WTCCC datasets, averaged across 10 trials with randomly selected sets of published genotypes. The asterisks show the average number of correct re-identifications. The boxes show the median, 25% quantile, 75% quantile, maximum and minimum numbers of correct re-identifications. Overall, the attack correctly re-identifies at least 10 cases with more than 75% probability, and on average re-identifies 15 cases, which is approximately 1% of all cases. No incorrect re-identifications occurred. (a) Results on the seven datasets, with default parameter values listed in Table 2. (b) Results with different precisions of the published statistics on the HT dataset, with other parameters fixed to their default values. (c) Results when varying the number of published genotypes on the HT dataset, with other parameters fixed to default values. (d) Results with varying numbers of genotypes used in the attack on the HT dataset, with other parameters fixed to default values. The Supplementary Document contains additional experimental results
Fig. 4.
Fig. 4.
The number of re-identified cases from the T2D dataset, based on the 36 SNPs published in Fraser et al. (2005) that are also available in the WTCCC dataset. The attack re-identifies a dozen cases on average, which is slightly fewer than when the published data is for 75 randomly selected genotypes. The number of re-identified cases gradually grows when more genotypes are used in the attack. The Supplementary Document contains additional results obtained by running the same experiment on other datasets of WTCCC

Similar articles

Cited by

References

    1. Agrawal R., et al. (1994) Fast algorithms for mining association rules. In: Bocca,J.B. et al. (eds) Proceedings of the 20th International Conference of Very Large Data Bases, VLDB , Springer, New York, NY, USA; Vol. 1215, pp. 487–499.
    1. Fraser J.A., et al. . (2005) Same-sex mating and the origin of the Vancouver Island Cryptococcus gattii outbreak. Nature , 437, 1360–1364. - PubMed
    1. Haines J.L., Pericak-Vance M.A. (2006) Genetic Analysis of Complex Disease . Hoboken, New Jersey, USA, John Wiley & Sons.
    1. Hinney A., et al. . (2007) Genome wide association study for early onset extreme obesity supports the role of fat mass and obesity associated gene variants. PLoS One , 2, e1361. - PMC - PubMed
    1. Homer N., et al. (2008) Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet. , 4, e1000167. - PMC - PubMed

Publication types