Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jun 8:9:e54507.
doi: 10.7554/eLife.54507.

Predicting geographic location from genetic variation with deep neural networks

Affiliations

Predicting geographic location from genetic variation with deep neural networks

C J Battey et al. Elife. .

Abstract

Most organisms are more closely related to nearby than distant members of their species, creating spatial autocorrelations in genetic data. This allows us to predict the location of origin of a genetic sample by comparing it to a set of samples of known geographic origin. Here, we describe a deep learning method, which we call Locator, to accomplish this task faster and more accurately than existing approaches. In simulations, Locator infers sample location to within 4.1 generations of dispersal and runs at least an order of magnitude faster than a recent model-based approach. We leverage Locator's computational efficiency to predict locations separately in windows across the genome, which allows us to both quantify uncertainty and describe the mosaic ancestry and patterns of geographic mixing that characterize many populations. Applied to whole-genome sequence data from Plasmodium parasites, Anopheles mosquitoes, and global human populations, this approach yields median test errors of 16.9km, 5.7km, and 85km, respectively.

Keywords: Anopheles; Plasmodium; evolutionary biology; genetics; genomics; human.

PubMed Disclaimer

Conflict of interest statement

CB, PR, AK No competing interests declared

Figures

Figure 1.
Figure 1.. Conceptual schematic of our approach.
Regions of the genome reflect correlated sets of genealogical relationships (A), each of which represents a set of ancestors with varying spatial positions back in time. We extract genotypes from windows across the genome (B), and train a deep neural network to approximate the relationship between genotypes and locations using Euclidean distance as the loss function (C). We can then use the trained network to predict the location of new genotypes held out from the training routine (D).
Figure 2.
Figure 2.. Validation error for Locator runs on simulations with varying dispersal rates.
Simulations were on a 50 × 50 landscape and error is expressed in map units. (A) True and predicted locations by population mean dispersal rate and number of SNPs. 450 randomly-sampled individuals were used for training. (B) Error for runs with 100,000 SNPs and varying numbers of training samples. (C) Error for runs with 450 training samples and varying number of SNPs. Plots with error in terms of generations of expected dispersal are shown in Figure 2—figure supplement 1.
Figure 2—figure supplement 1.
Figure 2—figure supplement 1.. Validation error for Locator runs on simulations with varying dispersal distance, expressed in generations of mean dispersal (test error divided by mean dispersal distance per generation).
(A) Error for runs with 100,000 SNPs and varying numbers of training samples. (B) Error for runs with 450 training samples and varying number of SNPs.
Figure 2—figure supplement 2.
Figure 2—figure supplement 2.. Example training and validation loss histories.
The first three epochs (with very high loss) were excluded from the plot to improve axis scaling.
Figure 3.
Figure 3.. Test error and run times for Locator and SPASIBA on simulated data with dispersal distance equal to 0.45 map units/generation (top; 450 randomly sampled training samples) and empirical data from the ag1000g phase one dataset (bottom; 612 training samples from 14 sampling localities).
Figure 3—figure supplement 1.
Figure 3—figure supplement 1.. Predicted (colored points) and true (black circles) locations for Locator and SPASIBA on the ag1000g dataset.
Number of SNPs per run is shown on the right. Both methods were run on randomly selected SNPs with minor allele count >2 from the first five million base pairs of chromosome 2L.
Figure 4.
Figure 4.. Predicted and true locations for eight individuals simulated in a population with mean per-generation dispersal 0.45 (roughly 1% of the landscape width).
Black points are predictions from 2Mbp windows, blue points are training sample locations, and the red point is the true location for each individual. Contours show the 95%, 50%, and 10% quantiles of a two-dimensional kernel density across all windows.
Figure 4—figure supplement 1.
Figure 4—figure supplement 1.. Predicted and true locations for eight individuals simulated in a population with an expected dispersal rate of 0.63 map units/generation, using a set of 10,000 randomly sampled SNPs.
Here, we generate predictions (black points) from bootstrap samples of the complete genotype matrix (in contrast to using separate sets of SNPs extracted from windows as used for figures in the main text). This could be useful for low-density genotyping data from approaches like ddRADseq, or when users lack a reference genome for windowing. In this setting, we see that the distribution of predictions is much smaller than fitting individual windows.
Figure 5.
Figure 5.. Top – Predicted locations for 153 Anopheles gambiae/coluzzii genomes from the AG1000G panel, using 612 training samples and a 2Mbp window size.
The geographic centroid of per-window predictions for each individual is shown in black points, and lines connect predicted to true locations. Sample localities are colored by the mean test error with size scaled to the number of training samples. Bottom – Uncertainty from predictions in 2Mbp windows. Contours show the 95%, 50%, and 10% quantiles of a two-dimensional kernel density across windows.
Figure 5—figure supplement 1.
Figure 5—figure supplement 1.. Comparison of cross-validation performance on the ag1000g dataset using SNPs from chromosome 3R, under varying network architectures and numbers of SNPs.
Boxplots show the distribution of Euclidean distance between the true and predicted locations of validation samples across 10 replicate training runs. Network shapes are described on the horizontal axis as 'layers × width’. Although two-layer networks are typically the least accurate, no single architecture provides consistently better performance across datasets of different sizes. Missing networks required over 12 GB GPU RAM.
Figure 5—figure supplement 2.
Figure 5—figure supplement 2.. Performance on 10,000 SNPs from chromosome 2L in the ag1000g phase one dataset when all samples from localities in the true country are dropped from the training set.
Figure 5—figure supplement 3.
Figure 5—figure supplement 3.. Performance on 10,000 SNPs from chromosome 2L in the ag1000g phase one dataset when all samples from the true locality are dropped from the training set.
Figure 6.
Figure 6.. Top – Predicted locations for 881 Plasmodium falciparum from the Plasmodium falciparum Community Project (Pearson et al., 2019) (5% of samples for each collecting locality), using 5084 training samples and a 500Kbp window size.
The geographic centroid of per-window predictions for each individual is shown in black points, and lines connect predicted to true locations. Sample localities are colored by the mean test error with size scaled to the number of training samples. Bottom – Uncertainty from predictions in 500Kbp windows. Contours show the 95%, 50%, and 10% quantiles of a two-dimensional kernel density across windows.
Figure 6—figure supplement 1.
Figure 6—figure supplement 1.. Centroid prediction error as a function of within-host diversity (FWS) for the Plasmodium falciparum dataset.
FWS scales from 0 (maximum complexity) to 1 (minimum complexity). The blue line shows a linear regression (p<2.2e-16,R2=0.006,slope=148.1). High within-host diversity does not appear to explain outliers in Locator’s prediction error.
Figure 7.
Figure 7.. Top – Predicted locations for 162 individuals from the HGDP panel, using 773 training samples and a 10Mbp window size.
The geographic centroid of per-window predictions for each individual is shown in black points, and lines connect predicted to true locations. Sample localities are colored by the mean test error with size scaled to the number of training samples. Bottom – Uncertainty from predictions in 10Mbp windows. Contours show the 95%, 50%, and 10% quantiles of a two-dimensional kernel density across windows.
Figure 7—figure supplement 1.
Figure 7—figure supplement 1.. Outliers in windowed Locator analyses identify genomic regions enriched for admixed ancestry.
(A) Windowed Locator predictions for Maya sample HGDP00871. (B) PCAs of all HGDP samples run on SNPs extracted from windows with predicted locations in western Europe (left) and west Africa (right). In these windows sample HGDP00871 (open points) clusters with individuals from region predicted by Locator in PC space, rather than with other genomes from the Americas.
Figure 8.
Figure 8.. Per-window test error and mean recombination rate for human populations in the HGDP dataset.
The top 2% of windows by test error were excluded from this analysis. The slope of the least-squares linear fit is −99.9723 km/(cM/Mbp) and has adjusted R2=0.2704.
Figure 8—figure supplement 1.
Figure 8—figure supplement 1.. Mean test error for HGDP samples in 10-megabase windows.
Triangles show approximate centromere locations.
Figure 8—figure supplement 2.
Figure 8—figure supplement 2.. Mean test error for HGDP samples in 10-centimorgan windows.
Triangles show approximate centromere locations.
Figure 8—figure supplement 3.
Figure 8—figure supplement 3.. Distributions of centroid prediction error across samples.
Despite differences in error among genomic windows (Figure 8—figure supplements 1 and 2), error in the mean genome-wide predicted location is very similar when using megabase (top) or centimorgan (bottom) windows.
Figure 8—figure supplement 4.
Figure 8—figure supplement 4.. Distributions of prediction error across windows when using megabase- versus centimorgan-based windows.
Figure 9.
Figure 9.. Predicted locations for HGDP samples from central Asia using a model trained on SNPs within 100 kb of EDAR.
Black points show sampling locations. Arrows are colored by genotype at variant rs3827760 and point towards the predicted location. Frequency of the A allele by longitude is shown below the map.

References

    1. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mané D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viégas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X. TensorFlow: large-scale machine learning on heterogeneous systems. arXiv. 2015 https://arxiv.org/abs/1603.04467
    1. Adrion JR, Galloway JG, Kern AD. Predicting the landscape of recombination using deep learning. Molecular Biology and Evolution. 2020;37:1790–1808. doi: 10.1093/molbev/msaa038. - DOI - PMC - PubMed
    1. Anopheles gambiae 1000 Genomes Consortium. Data analysis group. Partner working group. Sample collections—Angola: Burkina Faso: Cameroon: Gabon: Guinea: Guinea-Bissau: Kenya: Uganda: Crosses: Sequencing and data production. Web application development. Project coordination Genetic diversity of the African malaria vector anopheles gambiae. Nature. 2017;552:96–100. doi: 10.1038/nature24995. - DOI - PMC - PubMed
    1. Auburn S, Campino S, Miotto O, Djimde AA, Zongo I, Manske M, Maslen G, Mangano V, Alcock D, MacInnis B, Rockett KA, Clark TG, Doumbo OK, Ouédraogo JB, Kwiatkowski DP. Characterization of within-host Plasmodium falciparum diversity using next-generation sequence data. PLOS ONE. 2012;7:e32891. doi: 10.1371/journal.pone.0032891. - DOI - PMC - PubMed
    1. Baran Y, Quintela I, Carracedo A, Pasaniuc B, Halperin E. Enhanced localization of genetic samples through linkage-disequilibrium correction. The American Journal of Human Genetics. 2013;92:882–894. doi: 10.1016/j.ajhg.2013.04.023. - DOI - PMC - PubMed

Publication types