Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 May;28(5):435-451.
doi: 10.1089/cmb.2020.0445. Epub 2021 Jan 5.

Reconstructing Genotypes in Private Genomic Databases from Genetic Risk Scores

Affiliations

Reconstructing Genotypes in Private Genomic Databases from Genetic Risk Scores

Brooks Paige et al. J Comput Biol. 2021 May.

Abstract

Some organizations such as 23andMe and the UK Biobank have large genomic databases that they re-use for multiple different genome-wide association studies. Even research studies that compile smaller genomic databases often utilize these databases to investigate many related traits. It is common for the study to report a genetic risk score (GRS) model for each trait within the publication. Here, we show that under some circumstances, these GRS models can be used to recover the genetic variants of individuals in these genomic databases-a reconstruction attack. In particular, if two GRS models are trained by using a largely overlapping set of participants, it is often possible to determine the genotype for each of the individuals who were used to train one GRS model, but not the other. We demonstrate this theoretically and experimentally by analyzing the Cornell Dog Genome database. The accuracy of our reconstruction attack depends on how accurately we can estimate the rate of co-occurrence of pairs of single nucleotide polymorphisms within the private database, so if this aggregate information is ever released, it would drastically reduce the security of a private genomic database. Caution should be applied when using the same database for multiple analysis, especially when a small number of individuals are included or excluded from one part of the study.

Keywords: GWAS; genetic risk scores; genomic privacy; long-term privacy; reconstruction attack.

PubMed Disclaimer

Conflict of interest statement

The authors declare they have no competing financial interests.

Figures

FIG. 1.
FIG. 1.
We investigate the case where two GWAS studies are performed on two datasets that mostly contain the same individuals. We reconstruct the genotype of those individuals added to the second study, using the GRS from each study and an estimate of SNP frequencies. GRS, genetic risk score; GWAS, genome-wide association studies; SNP, single nucleotide polymorphism.
FIG. 2.
FIG. 2.
(A) We have perfect accuracy in reconstructing the genotype when K is known (using 200 random SNPs to estimate average breed weight in the Cornell Dog Database). (B) We can reconstruct all the genotypes of multiple dogs that are added to the second study and (C) this works in practice by using the data from the Cornell Dog Database, as in (A).
FIG. 3.
FIG. 3.
Example values taken by the noisy vector d^, given the true value of the corresponding SNP in the genome. (Left) adding one new participant; (right) adding three new participants. These figures are analogous to those in Figure 2, although in the case where K is not known and instead estimated from an independent public database.
FIG. 4.
FIG. 4.
Accuracy at reconstruction of genomes x0 using EM estimation and a noisy estimate K^, as compared with a natural baseline that always predicts the most common variant at each SNP locus. We use this as a baseline, because without any additional information about βM and βM+1, the most accurate prediction of the dog's genotype would be to predict the most common variant at each locus. Here, we define accuracy as the proportion of SNPs that are correctly identified in the dog that was found in the second GWAS study, but not the first. Each distribution is constructed from 500 experimental test points, in which we (1) took 10 random splits of the full dog dataset, assigning dogs to either the public or private dataset; (2) for each split, we tested the reconstruction 50 times, each time adding a different randomly sampled dog to the second GWAS study. The private dataset always has 1000 individuals; the public test dataset is of increasing size, improving performance. EM, expectation–maximization.
FIG. 5.
FIG. 5.
Results of Figure 4 broken down by individual dogs. Here, each point represents a dog and we define atypicality as the proportion of uncommon variants that the dog has compared with the public database—for instance, if 51% or more of dogs in the public database have a G in a specific locus, but this dog has a T, then this would count toward the dog's atypicality. In other words, dogs further to the right are less and less similar to average dogs present in the public dataset (measured by percentage of different variants). In contrast to the most-common-variant baseline, our method generalizes well even to dogs that are highly dissimilar to those in the public dataset. Larger public databases (right) provide more accurate population estimates of K^, leading to more accurate reconstructions overall.
FIG. 6.
FIG. 6.
Results for running the stochastic EM algorithm when estimating SNPs for three additional dogs simultaneously. This experimental setup replicates the experiment for one additional dog, across 5 public/private/test dataset splits, with 20 different test sets of three additional dogs for each. (Left) Accuracy at predicting SNP presence relative to the “most common variant” baseline. On average, the SEM algorithm predicts the correct SNP 75.5% of the time, relative to 71.5% for the baseline. (Right) As in the one-dog example, we see relative improvement in the performance of our algorithm when considering more atypical dogs. SEM, stochastic EM.
FIG. 7.
FIG. 7.
Accuracy at reconstruction of the genome of one additional individual, using EM estimation and a noisy estimate K^, measured as the size of the initial private database increases. For very small private databases, accuracy is very high, as changes in entries of β are clearly attributable to the new individual. Beyond a certain threshold, overall accuracy is quite stable. Error bars show mean and two standard deviations.

References

    1. Belsky, D.W., Moffitt, T.E., Sugden, K., et al. . 2013. Development and evaluation of a genetic risk score for obesity. Biodemography Soc. Biol. 59, 85–100 - PMC - PubMed
    1. Cai, R., Hao, Z., Winslett, M., et al. . 2015. Deterministic identification of specific individuals from GWAS results. Bioinformatics 31, 1701–1707 - PMC - PubMed
    1. Celeux, G., and Diebolt, J.. 1985. The SEM algorithm: A probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Comput. Stat. Q. 2, 73–82
    1. Chouraki, V., Reitz, C., Maury, F., et al. . 2016. Evaluation of a genetic risk score to improve risk prediction for Alzheimer's disease. J. Alzheimers Dis. 53, 921–932 - PMC - PubMed
    1. Day, F.R., Thompson, D.J., Helgason, H., et al. . 2017. Genomic analyses identify hundreds of variants associated with age at menarche and support a role for puberty timing in cancer risk. Nat. Genet. 49, 834–841 - PMC - PubMed

Publication types

LinkOut - more resources