Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 May 26;224(2):iyad068.
doi: 10.1093/genetics/iyad068.

Dispersal inference from population genetic variation using a convolutional neural network

Affiliations

Dispersal inference from population genetic variation using a convolutional neural network

Chris C R Smith et al. Genetics. .

Abstract

The geographic nature of biological dispersal shapes patterns of genetic variation over landscapes, making it possible to infer properties of dispersal from genetic variation data. Here, we present an inference tool that uses geographically distributed genotype data in combination with a convolutional neural network to estimate a critical population parameter: the mean per-generation dispersal distance. Using extensive simulation, we show that our deep learning approach is competitive with or outperforms state-of-the-art methods, particularly at small sample sizes. In addition, we evaluate varying nuisance parameters during training-including population density, demographic history, habitat size, and sampling area-and show that this strategy is effective for estimating dispersal distance when other model parameters are unknown. Whereas competing methods depend on information about local population density or accurate inference of identity-by-descent tracts, our method uses only single-nucleotide-polymorphism data and the spatial scale of sampling as input. Strikingly, and unlike other methods, our method does not use the geographic coordinates of the genotyped individuals. These features make our method, which we call "disperseNN," a potentially valuable new tool for estimating dispersal distance in nonmodel systems with whole genome data or reduced representation data. We apply disperseNN to 12 different species with publicly available data, yielding reasonable estimates for most species. Importantly, our method estimated consistently larger dispersal distances than mark-recapture calculations in the same species, which may be due to the limited geographic sampling area covered by some mark-recapture studies. Thus genetic tools like ours complement direct methods for improving our understanding of dispersal.

Keywords: deep learning; dispersal; machine learning; population genomics; space.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Diagram of the analysis workflow. Points are hypothetical sample locations (n=10) on a geographic map. Rectangular tensors are the output from 1D-convolution layers and average-pooling layers, and the columnar tensors are the outputs from fully connected layers. The number and dimensions of tensors will vary depending on the input dimensions; this example shows a single haplotype for each individual that is 58 (m) SNPs long. The box over the genotypes shows the size of the convolution kernel for the first layer. The two input branches are eventually concatenated into a single, intermediate tensor. Neural network schematic generated using PlotNeuralNet (https://github.com/HarisIqbal88/PlotNeuralNet).
Fig. 2.
Fig. 2.
Cartoon showing different sampling strategies. The large box represents the full simulated habitat. For some experiments, we both (i) varied the width of the square sampling window—see the boxes with different sizes—, and (ii) assigned a uniform-random position for the sampling window—see the boxes with different positions.
Fig. 3.
Fig. 3.
Comparison with existing methods (Parameter Set 1). Here, disperseNN is compared with Rousset’s method and the method of Ringbauer et al. (2017) with both true identity by descent tracts (“IBD-Analysis”) and tracts inferred by Refined IBD (“Refined IBD  +  IBD-Analysis”). Two different numbers of sampled individuals are shown: n=10 (top row) and n=100 (bottom row). The numbers of SNPs used with each sample size were 2.5×105 and 5×105, respectively. The dashed lines are y=x. Estimates greater than 5 are excluded from plots but are included in the MRAE calculation. Methods other than disperseNN sometimes produced undefined output; these data do not contribute to the MRAE. The MRAE for IBD-Analysis is greater than Refined IBD  +  IBD-Analysis with n=10 due to outlier points that inflated the MRAE using the former method and caused undefined output with the latter method.
Fig. 4.
Fig. 4.
Column 1. Cartoons of unknown parameters that may lead to model misspecification. Column 2. The unknown parameter was fixed during training, but testing was performed on data with different values of the parameter. Column 3. The unknown parameter was varied during training, and testing was performed on data from the same distribution. Column 4. The unknown parameter was varied during training, but testing was performed on out-of-sample values, i.e. larger values than were seen during training. The dashed lines are y=x. Outliers greater than 3 are excluded from the fixed-habitat-size plot. “Train: P” and “Pred: P” refer to the Parameter Sets used for training and testing, respectively. MRAE is the mean relative absolute error. All analyses used samples of n=100 individuals. (*The third row has a separate baseline MRAE, 0.09, due to using a smaller carrying capacity, which was chosen to alleviate computation time.)
Fig. 5.
Fig. 5.
Validation of the pretrained model (Parameter Set 11). Shown are 100 test datasets, each generated from an independent simulation. Open points indicate the mean estimate from 1,000 subsamples of 5,000 SNPs drawn from each dataset, with the sample size varying uniformly between 10 and 100 for each subsample. Also depicted is the range of estimates from the middle 95% of subsamples. The dashed line is y=x. Note the log scale. MRAE is the mean relative absolute error.

References

    1. The 1001 Genomes Consortium . 1,135 genomes reveal the global pattern of polymorphism in Arabidopsis thaliana. Cell. 2016;166(2):481–491. - PMC - PubMed
    1. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467, 2016, preprint: not peer reviewed.
    1. Abbott RJ, Gomes MF. Population genetic structure and outcrossing rate of Arabidopsis thaliana (L.) Heynh. Heredity. 1989;62(3):411–418. doi:10.1038/hdy.1989.56 - DOI
    1. Adrion JR, Galloway JG, Kern AD. Predicting the landscape of recombination using deep learning. Mol Biol Evol. 2020;37(6):1790–1808. doi:10.1093/molbev/msaa038 - DOI - PMC - PubMed
    1. Akçakaya HR, Brook BW. Methods for determining viability of wildlife populations in large landscapes. Models for Planning Wildlife conservation in Large Landscapes. 2008. p. 449–472.

Publication types