Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Nov;21(8):2598-2613.
doi: 10.1111/1755-0998.13413. Epub 2021 May 21.

Extending approximate Bayesian computation with supervised machine learning to infer demographic history from genetic polymorphisms using DIYABC Random Forest

Affiliations

Extending approximate Bayesian computation with supervised machine learning to infer demographic history from genetic polymorphisms using DIYABC Random Forest

François-David Collin et al. Mol Ecol Resour. 2021 Nov.

Abstract

Simulation-based methods such as approximate Bayesian computation (ABC) are well-adapted to the analysis of complex scenarios of populations and species genetic history. In this context, supervised machine learning (SML) methods provide attractive statistical solutions to conduct efficient inferences about scenario choice and parameter estimation. The Random Forest methodology (RF) is a powerful ensemble of SML algorithms used for classification or regression problems. Random Forest allows conducting inferences at a low computational cost, without preliminary selection of the relevant components of the ABC summary statistics, and bypassing the derivation of ABC tolerance levels. We have implemented a set of RF algorithms to process inferences using simulated data sets generated from an extended version of the population genetic simulator implemented in DIYABC v2.1.0. The resulting computer package, named DIYABC Random Forest v1.0, integrates two functionalities into a user-friendly interface: the simulation under custom evolutionary scenarios of different types of molecular data (microsatellites, DNA sequences or SNPs) and RF treatments including statistical tools to evaluate the power and accuracy of inferences. We illustrate the functionalities of DIYABC Random Forest v1.0 for both scenario choice and parameter estimation through the analysis of pseudo-observed and real data sets corresponding to pool-sequencing and individual-sequencing SNP data sets. Because of the properties inherent to the implemented RF methods and the large feature vector (including various summary statistics and their linear combinations) available for SNP data, DIYABC Random Forest v1.0 can efficiently contribute to the analysis of large SNP data sets to make inferences about complex population genetic histories.

Keywords: SNP; approximate Bayesian computation; demographic history; model or scenario selection; parameter estimation; pool-sequencing; population genetics; random forest; supervised machine learning.

PubMed Disclaimer

Figures

FIGURE 1
FIGURE 1
Evolutionary scenarios compared. The target population (pop 4) has three possible single (i.e., nonadmixed) population sources (pop 1, pop 2 or pop 3) composing a group of three scenarios without admixture (group 2 in the figure) and three possible admixed pairwise population sources (i.e., admixture between pop1& pop2, pop 1& pop3 and pop 2 & pop3) composing a group of three scenarios with admixture (group 1 in the figure). Demographic and historical parameters include four effective population sizes N1 , N2 , N3 and N4 (for populations 1, 2, 3, and 4, respectively) and three divergence or admixture time events (t1 , t2 and t3 ), For the scenarios with admixture, the parameter ra corresponds to the proportion of genes of a given source population entering into the admixed population 4. See text for details about prior distribution of parameters
FIGURE 2
FIGURE 2
Projection of the PoolSeq data sets from the training set on a single LDA axis when analysing the two groups of scenarios (a) or on the first two LDA axes when analysing the six scenarios separately (b). The six compared scenarios and the two groups of scenarios are detailed in Figure 1. The location of the PoolSeq pseudo‐observed data set in the LDA projection is indicated by a vertical line and a star symbol in panels a and b, respectively. The pseudo‐observed data sets was simulated under the (admixed) scenario 3 (belonging to the group 1) using the following parameter values: N 1 = 7,000, N 2 = 2,000, N 3 = 4,000, N 4 = 3,000, t 1 = 200, ra  = 0.3, t 2 = 300 and t 3 = 500
FIGURE 3
FIGURE 3
Contributions for the PoolSeq data analyses of the 30 most informative statistics to the random forest when choosing among scenarios considered separately (a) and when estimating the parameter t1/N4 under scenario 3 (b). The variable importance of each statistics is computed as the mean decrease of impurity across the trees, where the impurity measure is the Gini index and the residual sum of squares for scenario choice and parameter inference, respectively. For each variable, the sum of the impurity decrease across every tree of the forest is accumulated every time that variable is chosen to split a node. The sum is divided by the number of trees in the forest to give an average. The scale is irrelevant: only the relative values matter. The variable importance was computed for each of the 130 summary statistics provided by DIYABC Random Forest, plus the LDA axes for scenario choice (denoted LD) or the PLS components for parameter estimation (denoted Comp.) that were added to the feature vector. The higher the variable importance the more informative is the statistic. Population index(s) are indicated at the end of each statistics and correspond to those in Figure 1. More details about summary statistics can be found in Table S1. See Figure S3 for an illustration of the contributions of the most informative statistics when choosing among the two groups of scenarios and when estimating the parameters ra , t1 and N4 .

References

    1. Amit, Y. , & Geman, D. (1997). Shape quantization and recognition with randomized trees. Neural Computation, 9, 1545–1588.
    1. Anderson, J. , Belkin, M. , Goyal, N. , Rademacher, L. , & Voss, J. (2014). The more, the merrier: the blessing of dimensionality for learning large Gaussian mixtures. Proceedings of The 27th Conference on Learning Theory (pp. 1135–1164). PMLR 35.
    1. Angermueller, C. , Parnamaa, T. , Parts, L. , & Stegle, O. (2016). Deep learning for computational biology. Molecular Systems Biology, 12(7), 878. 10.15252/msb.20156651 - DOI - PMC - PubMed
    1. Beaumont, M. (2010). Approximate Bayesian computation in evolution and ecology. Annual Review of Ecology, Evolution, and Systematics, 41(1), 379–406. 10.1146/annurev-ecolsys-102209-144621 - DOI
    1. Beaumont, M. A. , Zhang, W. , & Balding, D. J. (2002). Approximate Bayesian computation in population genetics. Genetics, 162(4), 2025–2035.PMID: 12524368. - PMC - PubMed