Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Nov 9;362(6415):690-694.
doi: 10.1126/science.aau4832. Epub 2018 Oct 11.

Identity inference of genomic data using long-range familial searches

Affiliations

Identity inference of genomic data using long-range familial searches

Yaniv Erlich et al. Science. .

Abstract

Consumer genomics databases have reached the scale of millions of individuals. Recently, law enforcement authorities have exploited some of these databases to identify suspects via distant familial relatives. Using genomic data of 1.28 million individuals tested with consumer genomics, we investigated the power of this technique. We project that about 60% of the searches for individuals of European descent will result in a third-cousin or closer match, which theoretically allows their identification using demographic identifiers. Moreover, the technique could implicate nearly any U.S. individual of European descent in the near future. We demonstrate that the technique can also identify research participants of a public sequencing project. On the basis of these results, we propose a potential mitigation strategy and policy implications for human subject research.

PubMed Disclaimer

Conflict of interest statement

Competing interests: Y.E. and T.S. adapted the code for the cryptographic signatures. Y.E. and T.S are MyHeritage employees. Y.E. is also a consultant of ArcBio. I.P. holds equity in 23andMe. S.C. is a paid consultant of MyHeritage. When multiple companies are mentioned in this manuscript, we listed them in a lexicographic order.

Figures

Fig. 1.
Fig. 1.. The performance of long-range familial searches for various database sizes.
(A) The probability of finding at least one relative for various IBD thresholds (top) with 1.28 million searches of DTC-tested individuals (red) and 30 random GEDmatch searches (gray). Light gray shading indicates the 95% CI for the GEDmatch estimates. The dashed line indicates the probability of a surname inference from Y chromosome data (17). The bottom panel shows the 95% CIs (circles) and average total IBD length (squares) for a first cousin once removed (1C1R) to a fourth cousin once removed (4C1R) (20). (B) A population-genetic theoretical model for the probability of finding relatives up to a certain type of cousinship as a function of the database coverage of the population. 1C to 4C indicate first to fourth cousins.
Fig. 2.
Fig. 2.. Tracing a person of interest from a distant match using demographic identifiers.
(A) The possible relatives of a match (green) in a database. Each square represents a potential degree of relatedness. The range corresponds to the 5th to 95th percentile of shared IBD in centimorgans from (16). Red indicates relatives that could fit a bona fide 3C match (~100 cM).The average number of relatives is indicated in the top-left corner of each square on the basis of a fertility rate of 2.5 children per couple. Only genealogical relationships that are within 100-cM range include the average number of relatives. Nie/Nep, Niece/Nephew; G, Great; G2, Great-great; G3,Great-great-great; A/U, Aunt/Uncle. (B) An example of the geographical dispersion of third cousins or second cousins once removed around the matched relative. Every circle indicates 100 km. (C and D) The distribution of the expected age differences between matches and their potential relatives with a genetic distance of third cousins.The main text reports a conservative scenario, in which the age estimator of the target is in the highest bin of each histogram (red arrow).The age distribution is shown at a 10-year resolution (C) and at a 1-year resolution (D). (E) The entire pipeline of using demographic identifiers along with a long-range familial match to identify a U.S. person (blue type indicates the average number of people after incorporating each piece of information.).
Fig. 3.
Fig. 3.. Tracing a 1000Genomes sample using a long-range familial search.
The CEU pedigree is shown in black. To respect the privacy of the family, we omitted the sample identifiers and the exact pedigree structure. A GEDmatch search of the person of interest (black circle) returned two males (squares with gray dots) with a total IBD sharing of 180 and 171 cM to the target, respectively, and 62 cM between themselves. Using public genealogical records, we identified the ancestral couple (asterisk) of the matches and the person of interest.

Comment in

References

    1. Khan R, Mittelman D, Genome Biol. 19, 120 (2018). - PMC - PubMed
    1. Larkin L, Autosomal DNA testing comparison chart, The DNA Geek; http://thednageek.com/dna-tests/.
    1. Nelson SC, Fullerton SM, J. Genet. Couns 27, 770–781 (2018). - PubMed
    1. Gusev A et al., Genome Res. 19, 318–326 (2009). - PMC - PubMed
    1. Huff CD et al., Genome Res. 21, 768–774 (2011). - PMC - PubMed

Publication types