Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Mar 27:2024.03.25.586640.
doi: 10.1101/2024.03.25.586640.

Accelerated Bayesian inference of population size history from recombining sequence data

Affiliations

Accelerated Bayesian inference of population size history from recombining sequence data

Jonathan Terhorst. bioRxiv. .

Abstract

I present PHLASH, a new Bayesian method for inferring population history from whole genome sequence data. PHLASH is population history learning by averaging sampled histories: it works by drawing random, low-dimensional projections of the coalescent intensity function from the posterior distribution of a PSMC-like model, and averaging them together to form an accurate and adaptive size history estimator. On simulated data, PHLASH tends to be faster and have lower error than several competing methods including SMC++, MSMC2, and FITCOAL. Moreover, it provides a full posterior distribution over population size history, leading to automatic uncertainty quantification of the point estimates, as well to new Bayesian testing procedures for detecting population structure and ancient bottlenecks. On the technical side, the key advance is a novel algorithm for computing the score function (gradient of the log-likelihood) of a coalescent hidden Markov model: when there are M hidden states, the algorithm requires 𝒪M2 time and 𝒪1 memory per decoded position, the same cost as evaluating the log-likelihood itself using the naïve forward algorithm. This algorithm is combined with a hand-tuned implementation that fully leverages the power of modern GPU hardware, and the entire method has been released as an easy-to-use Python software package.

PubMed Disclaimer

Figures

Figure D.1:
Figure D.1:
Simulation results for models 1–3. For n=1000, shaded regions indicate 95% credible intervals. For all other n, the solid (dotted) lines represent the median (min/max) across all simulation replicates.
Figure D.2:
Figure D.2:
Simulation results for models 4–6. See Figure D.1 for additional information.
Figure D.3:
Figure D.3:
Simulation results for models 7–9. See Figure D.1 for additional information.
Figure D.4:
Figure D.4:
Simulation results for models 10–12. See Figure D.1 for additional information.
Figure D.5:
Figure D.5:
Ne estimates for archaic humans. Solid lines are posterior medians, and blue shaded area shows the 95% credible intervals for the filtered data. “Unfiltered” refers to the raw genotypes extracted from the Wohns et al. (2022) dataset, while “filtered” are the estimates obtained after running the filtering procedure described in the main text.
Figure 1:
Figure 1:
Parallel algorithm for evaluating the likelihood of an HMM. The exact algorithm evaluates the likelihood sequentially across each observation (grey blocks in figure). The parallel algorithm breaks the observation sequence up into a set of overlapping blocks which can be evaluated in parallel. Blue blocks are used to “warm up” the filtering distribution, and red blocks contribute to the likelihood calculation.
Figure 2:
Figure 2:
Example output of the algorithm for the “sawtooth” demography (Schiffels and Durbin, 2014; Adrion et al., 2020). n=1000 diploid samples were simulated according to the size history shown in black. The posterior distribution returned by PHLASH is summarized by its median (dark blue line), central 95% point-wise credible interval (shaded blue region), and 95% simultaneous credible band (dashed green line).
Figure 3:
Figure 3:
L2 versus total variation error for the African3Epoch_S16 demography when n = 1. phlash has large mean-squared error relative to msmc2 and smc++ (a), but it mainly occurs at very recent times, when there is low density of coalescence. Total variation error (b–c) accounts for this by measuring the L1 distance between the true and inferred density functions.
Figure 4:
Figure 4:
Benchmarking results. (a) Time needed to evaluate score function as a function of batch size. The sequence length was L=105. Custom algorithm is compared forward- and reverse-mode automatic differentiation of the accelerated/linear-time forward algorithm using Jax (Bradbury et al., 2018). For n=1000, reverse-mode automatic differentiation exhausted the available GPU memory and was unable to be evaluated. (b) Mean CPU time and (c) peak memory usage for the various methods, with standard error bars.
Figure 5:
Figure 5:
Accuracy of likelihood approximations. Panel (a) is a box-whisker plot of the relative error, defined as loglikexactloglikparallel/loglikexact, for various settings of the overlap parameter f, across all simulations. Panels (b)–(e) compare the results of model fitting using the composite likelihood approximate, versus independent likelihoods (see text for description).
Figure 6:
Figure 6:
Posterior distribution of the recombination rate for each of the models shown in Table D.1.
Figure 7:
Figure 7:
Results of running phlash on 3,609 genomes. Panel (a): estimates for all populations. Vertical shaded region is 813–930kya, see Section 4.4. (b) Estimates after grouping by super-population. (c-d): Estimates and 95% credible intervals for three (c) modern and (d) archaic subpopulations.
Figure 8:
Figure 8:
Posterior inference of population structure. (a) Posterior distribution (median + 95% credible interval) of cross-coalescent rate between the Yoruba (YRI) and Han (CHB) populations. (b) Posterior distribution of non-parametric ratio estimator. (c) Three-population out-of-Africa model used to generate simulated data.
Figure 9:
Figure 9:
Analyzing ancient DNA using phlash. a) Histogram of the different mutation types in the ancient samples. Enrichment for CT mutations are likely the result of deamination, as are TC mutations after accounting for uncertainty in the ancestral allele. GA mutations are likely the result of deaminated bases on the complementary strand. b) Size history estimates for the Vindija Neanderthal before and after filtering out probable deaminated bases.
Figure 10:
Figure 10:
Searching for signs of an ancient bottleneck in real and simulated human data. (a) Histogram of posterior median of inft>tancNet for each of the 159 populations considered. (b-c) The estimated size history of the Yoruba/Han population (black line, estimated from real data) was modified to incorporate a bottleneck. Orange, green, and blue correspond to bottleneck strength α100,101,102 respectively. Grey shaded region is 813×103,930×103 years before present, assuming a generation time of 24 years. Dashed lines represent αNet, the population size during the bottleneck, and solid lines represent the corresponding phlash estimates. (d) Mirror plot of posterior density of test statistic for Han and Yoruba. Shaded density curves are its distribution in simulated data; bar plots are a histogram of its distribution in real data.

Similar articles

References

    1. Adrion Jeffrey R et al. (2020). “A community-maintained standard library of population genetic models”. In: elife 9, e54967. - PMC - PubMed
    1. Arredondo Armando et al. (2021). “Inferring number of populations and changes in connectivity under the n-island model”. In: Heredity 126.6, pp. 896–912. - PMC - PubMed
    1. Al-Asadi Hussein et al. (2019). “Estimating recent migration and population-size surfaces”. In: PLoS genetics 15.1, e1007908. - PMC - PubMed
    1. Baharian Soheil and Gravel Simon (2018). “On the decidability of population size histories from finite allele frequency spectra”. In: Theoretical Population Biology 120, pp. 42–51. - PubMed
    1. Baumdicker Franz et al. (2022). “Efficient ancestry and mutation simulation with msprime 1.0”. In: Genetics 220.3, iyab229. - PMC - PubMed

Publication types