. 2024 May 3;41(5):msae077.

doi: 10.1093/molbev/msae077.

Computationally Efficient Demographic History Inference from Allele Frequencies with Supervised Machine Learning

Linh N Tran^{1

2}, Connie K Sun², Travis J Struck², Mathews Sajan², Ryan N Gutenkunst²

Affiliations

¹ Genetics Graduate Interdisciplinary Program, University of Arizona, Tucson, AZ 85721, USA.
² Department of Molecular & Cellular Biology, University of Arizona, Tucson, AZ 85721, USA.

PMID: 38636507
PMCID: PMC11082913
DOI: 10.1093/molbev/msae077

Computationally Efficient Demographic History Inference from Allele Frequencies with Supervised Machine Learning

Linh N Tran et al. Mol Biol Evol. 2024.

. 2024 May 3;41(5):msae077.

doi: 10.1093/molbev/msae077.

Authors

Linh N Tran^{1

2}, Connie K Sun², Travis J Struck², Mathews Sajan², Ryan N Gutenkunst²

Affiliations

¹ Genetics Graduate Interdisciplinary Program, University of Arizona, Tucson, AZ 85721, USA.
² Department of Molecular & Cellular Biology, University of Arizona, Tucson, AZ 85721, USA.

PMID: 38636507
PMCID: PMC11082913
DOI: 10.1093/molbev/msae077

Abstract

Inferring past demographic history of natural populations from genomic data is of central concern in many studies across research fields. Previously, our group had developed dadi, a widely used demographic history inference method based on the allele frequency spectrum (AFS) and maximum composite-likelihood optimization. However, dadi's optimization procedure can be computationally expensive. Here, we present donni (demography optimization via neural network inference), a new inference method based on dadi that is more efficient while maintaining comparable inference accuracy. For each dadi-supported demographic model, donni simulates the expected AFS for a range of model parameters then trains a set of Mean Variance Estimation neural networks using the simulated AFS. Trained networks can then be used to instantaneously infer the model parameters from future genomic data summarized by an AFS. We demonstrate that for many demographic models, donni can infer some parameters, such as population size changes, very well and other parameters, such as migration rates and times of demographic events, fairly well. Importantly, donni provides both parameter and confidence interval estimates from input AFS with accuracy comparable to parameters inferred by dadi's likelihood optimization while bypassing its long and computationally intensive evaluation process. donni's performance demonstrates that supervised machine learning algorithms may be a promising avenue for developing more sustainable and computationally efficient demographic history inference methods.

Keywords: demographic history inference; machine learning; population genomics.

PubMed Disclaimer

Figures

**Fig. 1.**
Schematic of the workflow for training and testing donni. For a given demographic model a), we drew sets of model parameters b) from a biologically relevant range (supplementary table S1, Supplementary Material online). Each parameter set represents a demographic history and corresponds to an expected AFS. These parameters were input into simulator programs c) to generate training and test AFS d). We use the expected AFS simulated with dadi and their corresponding parameters as training data for donni’s MVE networks e). We generated test data either by Poisson sampling from dadi-simulated AFS or by varying recombination rates with msprime, resulting in a change in test data variance compared to training AFS. The output of donni’s trained networks includes both inferred parameters and their confidence intervals f).

**Fig. 2.**
Inference accuracy and computing time of donni and dadi for a two-population model. a) The two-population split-migration model with four parameters: $ν_{1}$ and $ν_{2}$ are relative sizes of each population to the ancestral, T is time of split, and m is the migration rate. b-i) Inference accuracy by donni b-e) and dadi f-i) for the four parameters on 100 test AFS (sample size of 20 haplotypes). j) Distribution of optimization times among test datasets for dadi. k) Computing time required for generating donni’s trained networks for two sample sizes. Generate data include computing time for generating 5,000 dadi-simulated AFS as training data. Tuning & training is the total computing time for hyperparameter tuning and training the MVE network using the simulated data.

**Fig. 3.**
Uninformative AFS affecting inference accuracy and uncertainty quantification method validation. a) The one-population two-epoch model with two parameters, ν for size change and T for time of size change. b-c) Inference accuracy for ν and T by donni on 100 test AFS, colored by simulated $\frac{T}{ν}$ values. d) Confidence interval coverage for ν and T by donni. The observed coverage is the percentage of test AFS that have the simulated parameter values captured within the corresponding expected interval. e-f) As an example, we show details of the $95 %$ confidence interval data points from panel d for 100 test AFS. The simulated values for ν e) and T f) of these AFS are colored by their $\frac{T}{ν}$ values, similar to panels b-c. donni’s inferred parameter values and $95 %$ confidence interval outputs are in brown. The percentage of simulated color dots lying within donni’s inferred brown interval gives the observed coverage at $95 %$ . The light shades are the simulated parameter range (supplementary table S2, Supplementary Material online) used in simulating training and test AFS. The 100 test AFS are sorted along the x axis using true $\frac{T}{ν}$ values.

**Fig. 4.**
donni’s inference accuracy and uncertainty quantification coverage on msprime-simulated test AFS with linkage. Each row shows the confidence interval coverage and inference accuracy for select parameters of the split-migration demographic model (Fig. 2a) at varying recombination rate. Recombination rate decreases from top to bottom row, corresponding to increased linkage and data variance in the msprime-simulated test AFS. The same networks (trained on dadi-simulated AFS) were used in this analysis as in Fig. 2f-i.

**Fig. 5.**
Inference accuracy compared with dadi and confidence interval coverage by donni for the OOA demographic model. a) The three-population OOA model with 14 demographic history parameters. b-e) Inference accuracy for representative parameters on 30 simulated test AFS inferred by donni. g-j) Inference accuracy for the same parameters and 30 test AFS inferred by dadi. Each of the 30 test AFS is represented by a different color dot. For the accuracy of the rest of the parameters see supplementary fig. S8 and table S1, Supplementary Material online. f) donni confidence interval coverage for all model parameters.

See this image and copyright information in PMC

Update of

Computationally efficient demographic history inference from allele frequencies with supervised machine learning.
Tran LN, Sun CK, Struck TJ, Sajan M, Gutenkunst RN. Tran LN, et al. bioRxiv [Preprint]. 2024 Feb 15:2023.05.24.542158. doi: 10.1101/2023.05.24.542158. bioRxiv. 2024. Update in: Mol Biol Evol. 2024 May 3;41(5):msae077. doi: 10.1093/molbev/msae077. PMID: 38405827 Free PMC article. Updated. Preprint.

References

1. Achaz G. Frequency spectrum neutrality tests: one for all and all for one. Genetics. 2009:183(1):249–258. 10.1534/genetics.109.104042. - DOI - PMC - PubMed
1. Baharian S, Gravel S. On the decidability of population size histories from finite allele frequency spectra. Theor Popul Biol. 2018:120:42–51. 10.1016/j.tpb.2017.12.008. - DOI - PubMed
1. Baumdicker F, Bisschop G, Goldstein D, Gower G, Ragsdale AP, Tsambos G, Zhu S, Eldon B, Ellerman EC, Galloway JG, et al. Efficient ancestry and mutation simulation with msprime 1.0. Genetics. 2022:220(3):iyab229. 10.1093/genetics/iyab229. - DOI - PMC - PubMed
1. Bergström A, McCarthy SA, Hui R, Almarri MA, Ayub Q, Danecek P, Chen Y, Felkel S, Hallast P, Kamm J, et al. Insights into human genetic variation and population history from 929 diverse genomes. Science. 2020:367(6484):eaay5012. 10.1126/science.aay5012. - DOI - PMC - PubMed
1. Bhaskar A, Song YS. Descartes’ rule of signs and the identifiability of population demographic models from genomic variation data. Ann Stat. 2014:42(6):2469– 2493. 10.1214/14-AOS1264. - DOI - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Computationally Efficient Demographic History Inference from Allele Frequencies with Supervised Machine Learning

Affiliations

Computationally Efficient Demographic History Inference from Allele Frequencies with Supervised Machine Learning

Authors

Affiliations

Abstract

Figures

Update of

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous