Extending approximate Bayesian computation with supervised machine learning to infer demographic history from genetic polymorphisms using DIYABC Random Forest

François-David Collin¹, Ghislain Durif¹, Louis Raynal¹, Eric Lombaert², Mathieu Gautier³, Renaud Vitalis³, Jean-Michel Marin¹, Arnaud Estoup³

Affiliations

¹ IMAG, Univ Montpellier, CNRS, UMR 5149, Montpellier, France.
² ISA, INRAE, CNRS, Univ Côte d'Azur, Sophia Antipolis, France.
³ CBGP, Univ Montpellier, CIRAD, INRAE, Institut Agro, IRD, Montpellier, France.

PMID: 33950563
PMCID: PMC8596733
DOI: 10.1111/1755-0998.13413

Extending approximate Bayesian computation with supervised machine learning to infer demographic history from genetic polymorphisms using DIYABC Random Forest

François-David Collin et al. Mol Ecol Resour. 2021 Nov.

. 2021 Nov;21(8):2598-2613.

doi: 10.1111/1755-0998.13413. Epub 2021 May 21.

Authors

François-David Collin¹, Ghislain Durif¹, Louis Raynal¹, Eric Lombaert², Mathieu Gautier³, Renaud Vitalis³, Jean-Michel Marin¹, Arnaud Estoup³

Affiliations

¹ IMAG, Univ Montpellier, CNRS, UMR 5149, Montpellier, France.
² ISA, INRAE, CNRS, Univ Côte d'Azur, Sophia Antipolis, France.
³ CBGP, Univ Montpellier, CIRAD, INRAE, Institut Agro, IRD, Montpellier, France.

PMID: 33950563
PMCID: PMC8596733
DOI: 10.1111/1755-0998.13413

Abstract

Simulation-based methods such as approximate Bayesian computation (ABC) are well-adapted to the analysis of complex scenarios of populations and species genetic history. In this context, supervised machine learning (SML) methods provide attractive statistical solutions to conduct efficient inferences about scenario choice and parameter estimation. The Random Forest methodology (RF) is a powerful ensemble of SML algorithms used for classification or regression problems. Random Forest allows conducting inferences at a low computational cost, without preliminary selection of the relevant components of the ABC summary statistics, and bypassing the derivation of ABC tolerance levels. We have implemented a set of RF algorithms to process inferences using simulated data sets generated from an extended version of the population genetic simulator implemented in DIYABC v2.1.0. The resulting computer package, named DIYABC Random Forest v1.0, integrates two functionalities into a user-friendly interface: the simulation under custom evolutionary scenarios of different types of molecular data (microsatellites, DNA sequences or SNPs) and RF treatments including statistical tools to evaluate the power and accuracy of inferences. We illustrate the functionalities of DIYABC Random Forest v1.0 for both scenario choice and parameter estimation through the analysis of pseudo-observed and real data sets corresponding to pool-sequencing and individual-sequencing SNP data sets. Because of the properties inherent to the implemented RF methods and the large feature vector (including various summary statistics and their linear combinations) available for SNP data, DIYABC Random Forest v1.0 can efficiently contribute to the analysis of large SNP data sets to make inferences about complex population genetic histories.

Keywords: SNP; approximate Bayesian computation; demographic history; model or scenario selection; parameter estimation; pool-sequencing; population genetics; random forest; supervised machine learning.

PubMed Disclaimer

Figures

**FIGURE 1**
Evolutionary scenarios compared. The target population (pop 4) has three possible single (i.e., nonadmixed) population sources (pop 1, pop 2 or pop 3) composing a group of three scenarios without admixture (group 2 in the figure) and three possible admixed pairwise population sources (i.e., admixture between pop1& pop2, pop 1& pop3 and pop 2 & pop3) composing a group of three scenarios with admixture (group 1 in the figure). Demographic and historical parameters include four effective population sizes N₁ , N₂ , N₃ and N₄ (for populations 1, 2, 3, and 4, respectively) and three divergence or admixture time events (t₁ , t₂ and t₃ ), For the scenarios with admixture, the parameter *r_a* corresponds to the proportion of genes of a given source population entering into the admixed population 4. See text for details about prior distribution of parameters

**FIGURE 2**
Projection of the PoolSeq data sets from the training set on a single LDA axis when analysing the two groups of scenarios (a) or on the first two LDA axes when analysing the six scenarios separately (b). The six compared scenarios and the two groups of scenarios are detailed in Figure 1. The location of the PoolSeq pseudo‐observed data set in the LDA projection is indicated by a vertical line and a star symbol in panels a and b, respectively. The pseudo‐observed data sets was simulated under the (admixed) scenario 3 (belonging to the group 1) using the following parameter values: N ₁ = 7,000, N ₂ = 2,000, N ₃ = 4,000, N ₄ = 3,000, t ₁ = 200, *r_a* = 0.3, t ₂ = 300 and t ₃ = 500

**FIGURE 3**
Contributions for the PoolSeq data analyses of the 30 most informative statistics to the random forest when choosing among scenarios considered separately (a) and when estimating the parameter *t₁/N₄* under scenario 3 (b). The variable importance of each statistics is computed as the mean decrease of impurity across the trees, where the impurity measure is the Gini index and the residual sum of squares for scenario choice and parameter inference, respectively. For each variable, the sum of the impurity decrease across every tree of the forest is accumulated every time that variable is chosen to split a node. The sum is divided by the number of trees in the forest to give an average. The scale is irrelevant: only the relative values matter. The variable importance was computed for each of the 130 summary statistics provided by DIYABC Random Forest, plus the LDA axes for scenario choice (denoted LD) or the PLS components for parameter estimation (denoted Comp.) that were added to the feature vector. The higher the variable importance the more informative is the statistic. Population index(s) are indicated at the end of each statistics and correspond to those in Figure 1. More details about summary statistics can be found in Table S1. See Figure S3 for an illustration of the contributions of the most informative statistics when choosing among the two groups of scenarios and when estimating the parameters *r_a* , t₁ and N₄ _.

See this image and copyright information in PMC

References

1. Amit, Y. , & Geman, D. (1997). Shape quantization and recognition with randomized trees. Neural Computation, 9, 1545–1588.
1. Anderson, J. , Belkin, M. , Goyal, N. , Rademacher, L. , & Voss, J. (2014). The more, the merrier: the blessing of dimensionality for learning large Gaussian mixtures. Proceedings of The 27th Conference on Learning Theory (pp. 1135–1164). PMLR 35.
1. Angermueller, C. , Parnamaa, T. , Parts, L. , & Stegle, O. (2016). Deep learning for computational biology. Molecular Systems Biology, 12(7), 878. 10.15252/msb.20156651 - DOI - PMC - PubMed
1. Beaumont, M. (2010). Approximate Bayesian computation in evolution and ecology. Annual Review of Ecology, Evolution, and Systematics, 41(1), 379–406. 10.1146/annurev-ecolsys-102209-144621 - DOI
1. Beaumont, M. A. , Zhang, W. , & Balding, D. J. (2002). Approximate Bayesian computation in population genetics. Genetics, 162(4), 2025–2035.PMID: 12524368. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

ABSint ANR-18-CE40-0034/French Agence National pour la Recherche (ANR)

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Extending approximate Bayesian computation with supervised machine learning to infer demographic history from genetic polymorphisms using DIYABC Random Forest

Affiliations

Extending approximate Bayesian computation with supervised machine learning to infer demographic history from genetic polymorphisms using DIYABC Random Forest

Authors

Affiliations

Abstract

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases