Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 May 3;41(5):msae077.
doi: 10.1093/molbev/msae077.

Computationally Efficient Demographic History Inference from Allele Frequencies with Supervised Machine Learning

Affiliations

Computationally Efficient Demographic History Inference from Allele Frequencies with Supervised Machine Learning

Linh N Tran et al. Mol Biol Evol. .

Abstract

Inferring past demographic history of natural populations from genomic data is of central concern in many studies across research fields. Previously, our group had developed dadi, a widely used demographic history inference method based on the allele frequency spectrum (AFS) and maximum composite-likelihood optimization. However, dadi's optimization procedure can be computationally expensive. Here, we present donni (demography optimization via neural network inference), a new inference method based on dadi that is more efficient while maintaining comparable inference accuracy. For each dadi-supported demographic model, donni simulates the expected AFS for a range of model parameters then trains a set of Mean Variance Estimation neural networks using the simulated AFS. Trained networks can then be used to instantaneously infer the model parameters from future genomic data summarized by an AFS. We demonstrate that for many demographic models, donni can infer some parameters, such as population size changes, very well and other parameters, such as migration rates and times of demographic events, fairly well. Importantly, donni provides both parameter and confidence interval estimates from input AFS with accuracy comparable to parameters inferred by dadi's likelihood optimization while bypassing its long and computationally intensive evaluation process. donni's performance demonstrates that supervised machine learning algorithms may be a promising avenue for developing more sustainable and computationally efficient demographic history inference methods.

Keywords: demographic history inference; machine learning; population genomics.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Schematic of the workflow for training and testing donni. For a given demographic model a), we drew sets of model parameters b) from a biologically relevant range (supplementary table S1, Supplementary Material online). Each parameter set represents a demographic history and corresponds to an expected AFS. These parameters were input into simulator programs c) to generate training and test AFS d). We use the expected AFS simulated with dadi and their corresponding parameters as training data for donni’s MVE networks e). We generated test data either by Poisson sampling from dadi-simulated AFS or by varying recombination rates with msprime, resulting in a change in test data variance compared to training AFS. The output of donni’s trained networks includes both inferred parameters and their confidence intervals f).
Fig. 2.
Fig. 2.
Inference accuracy and computing time of donni and dadi for a two-population model. a) The two-population split-migration model with four parameters: ν1 and ν2 are relative sizes of each population to the ancestral, T is time of split, and m is the migration rate. b-i) Inference accuracy by donni b-e) and dadi f-i) for the four parameters on 100 test AFS (sample size of 20 haplotypes). j) Distribution of optimization times among test datasets for dadi. k) Computing time required for generating donni’s trained networks for two sample sizes. Generate data include computing time for generating 5,000 dadi-simulated AFS as training data. Tuning & training is the total computing time for hyperparameter tuning and training the MVE network using the simulated data.
Fig. 3.
Fig. 3.
Uninformative AFS affecting inference accuracy and uncertainty quantification method validation. a) The one-population two-epoch model with two parameters, ν for size change and T for time of size change. b-c) Inference accuracy for ν and T by donni on 100 test AFS, colored by simulated Tν values. d) Confidence interval coverage for ν and T by donni. The observed coverage is the percentage of test AFS that have the simulated parameter values captured within the corresponding expected interval. e-f) As an example, we show details of the 95% confidence interval data points from panel d for 100 test AFS. The simulated values for ν e) and T f) of these AFS are colored by their Tν values, similar to panels b-c. donni’s inferred parameter values and 95% confidence interval outputs are in brown. The percentage of simulated color dots lying within donni’s inferred brown interval gives the observed coverage at 95%. The light shades are the simulated parameter range (supplementary table S2, Supplementary Material online) used in simulating training and test AFS. The 100 test AFS are sorted along the x axis using true Tν values.
Fig. 4.
Fig. 4.
donni’s inference accuracy and uncertainty quantification coverage on msprime-simulated test AFS with linkage. Each row shows the confidence interval coverage and inference accuracy for select parameters of the split-migration demographic model (Fig. 2a) at varying recombination rate. Recombination rate decreases from top to bottom row, corresponding to increased linkage and data variance in the msprime-simulated test AFS. The same networks (trained on dadi-simulated AFS) were used in this analysis as in Fig. 2f-i.
Fig. 5.
Fig. 5.
Inference accuracy compared with dadi and confidence interval coverage by donni for the OOA demographic model. a) The three-population OOA model with 14 demographic history parameters. b-e) Inference accuracy for representative parameters on 30 simulated test AFS inferred by donni. g-j) Inference accuracy for the same parameters and 30 test AFS inferred by dadi. Each of the 30 test AFS is represented by a different color dot. For the accuracy of the rest of the parameters see supplementary fig. S8 and table S1, Supplementary Material online. f) donni confidence interval coverage for all model parameters.

Update of

References

    1. Achaz G. Frequency spectrum neutrality tests: one for all and all for one. Genetics. 2009:183(1):249–258. 10.1534/genetics.109.104042. - DOI - PMC - PubMed
    1. Baharian S, Gravel S. On the decidability of population size histories from finite allele frequency spectra. Theor Popul Biol. 2018:120:42–51. 10.1016/j.tpb.2017.12.008. - DOI - PubMed
    1. Baumdicker F, Bisschop G, Goldstein D, Gower G, Ragsdale AP, Tsambos G, Zhu S, Eldon B, Ellerman EC, Galloway JG, et al. Efficient ancestry and mutation simulation with msprime 1.0. Genetics. 2022:220(3):iyab229. 10.1093/genetics/iyab229. - DOI - PMC - PubMed
    1. Bergström A, McCarthy SA, Hui R, Almarri MA, Ayub Q, Danecek P, Chen Y, Felkel S, Hallast P, Kamm J, et al. Insights into human genetic variation and population history from 929 diverse genomes. Science. 2020:367(6484):eaay5012. 10.1126/science.aay5012. - DOI - PMC - PubMed
    1. Bhaskar A, Song YS. Descartes’ rule of signs and the identifiability of population demographic models from genomic variation data. Ann Stat. 2014:42(6):2469– 2493. 10.1214/14-AOS1264. - DOI - PMC - PubMed

Publication types