This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Mar 27:2024.03.25.586640.

doi: 10.1101/2024.03.25.586640.

Accelerated Bayesian inference of population size history from recombining sequence data

Jonathan Terhorst¹

Affiliations

PMID: 38585997
PMCID: PMC10996539
DOI: 10.1101/2024.03.25.586640

Accelerated Bayesian inference of population size history from recombining sequence data

Jonathan Terhorst. bioRxiv. 2024.

[Preprint]. 2024 Mar 27:2024.03.25.586640.

doi: 10.1101/2024.03.25.586640.

Author

Jonathan Terhorst¹

Affiliation

¹ Department of Statistics, University of Michigan, Ann Arbor, MI, USA.

PMID: 38585997
PMCID: PMC10996539
DOI: 10.1101/2024.03.25.586640

Abstract

I present PHLASH, a new Bayesian method for inferring population history from whole genome sequence data. PHLASH is population history learning by averaging sampled histories: it works by drawing random, low-dimensional projections of the coalescent intensity function from the posterior distribution of a PSMC-like model, and averaging them together to form an accurate and adaptive size history estimator. On simulated data, PHLASH tends to be faster and have lower error than several competing methods including SMC++, MSMC2, and FITCOAL. Moreover, it provides a full posterior distribution over population size history, leading to automatic uncertainty quantification of the point estimates, as well to new Bayesian testing procedures for detecting population structure and ancient bottlenecks. On the technical side, the key advance is a novel algorithm for computing the score function (gradient of the log-likelihood) of a coalescent hidden Markov model: when there are $M$ hidden states, the algorithm requires $𝒪 (M^{2})$ time and $𝒪 (1)$ memory per decoded position, the same cost as evaluating the log-likelihood itself using the naïve forward algorithm. This algorithm is combined with a hand-tuned implementation that fully leverages the power of modern GPU hardware, and the entire method has been released as an easy-to-use Python software package.

PubMed Disclaimer

Figures

**Figure D.1:**
Simulation results for models 1–3. For $n = 1000$ , shaded regions indicate 95% credible intervals. For all other $n$ , the solid (dotted) lines represent the median (min/max) across all simulation replicates.

**Figure D.2:**
Simulation results for models 4–6. See Figure D.1 for additional information.

**Figure D.3:**
Simulation results for models 7–9. See Figure D.1 for additional information.

**Figure D.4:**
Simulation results for models 10–12. See Figure D.1 for additional information.

**Figure D.5:**
$N_{e}$ estimates for archaic humans. Solid lines are posterior medians, and blue shaded area shows the 95% credible intervals for the filtered data. “Unfiltered” refers to the raw genotypes extracted from the Wohns et al. (2022) dataset, while “filtered” are the estimates obtained after running the filtering procedure described in the main text.

**Figure 1:**
Parallel algorithm for evaluating the likelihood of an HMM. The exact algorithm evaluates the likelihood sequentially across each observation (grey blocks in figure). The parallel algorithm breaks the observation sequence up into a set of overlapping blocks which can be evaluated in parallel. Blue blocks are used to “warm up” the filtering distribution, and red blocks contribute to the likelihood calculation.

**Figure 2:**
Example output of the algorithm for the “sawtooth” demography (Schiffels and Durbin, 2014; Adrion et al., 2020). $n = 1000$ diploid samples were simulated according to the size history shown in black. The posterior distribution returned by PHLASH is summarized by its median (dark blue line), central 95% point-wise credible interval (shaded blue region), and 95% simultaneous credible band (dashed green line).

**Figure 3:**
$L_{2}$ versus total variation error for the African3Epoch_S16 demography when n = 1. phlash has large mean-squared error relative to msmc2 and smc++ (a), but it mainly occurs at very recent times, when there is low density of coalescence. Total variation error (b–c) accounts for this by measuring the $L_{1}$ distance between the true and inferred density functions.

**Figure 4:**
Benchmarking results. (a) Time needed to evaluate score function as a function of batch size. The sequence length was $L = 10^{5}$ . Custom algorithm is compared forward- and reverse-mode automatic differentiation of the accelerated/linear-time forward algorithm using Jax (Bradbury et al., 2018). For $n = 1000$ , reverse-mode automatic differentiation exhausted the available GPU memory and was unable to be evaluated. (b) Mean CPU time and (c) peak memory usage for the various methods, with standard error bars.

**Figure 5:**
Accuracy of likelihood approximations. Panel (a) is a box-whisker plot of the relative error, defined as $|{loglik}_{exact} - {loglik}_{parallel}| / {loglik}_{exact}$ , for various settings of the overlap parameter $f$ , across all simulations. Panels (b)–(e) compare the results of model fitting using the composite likelihood approximate, versus independent likelihoods (see text for description).

**Figure 6:**
Posterior distribution of the recombination rate for each of the models shown in Table D.1.

**Figure 7:**
Results of running phlash on 3,609 genomes. Panel (a): estimates for all populations. Vertical shaded region is 813–930kya, see Section 4.4. (b) Estimates after grouping by super-population. (c-d): Estimates and 95% credible intervals for three (c) modern and (d) archaic subpopulations.

**Figure 8:**
Posterior inference of population structure. (a) Posterior distribution (median + 95% credible interval) of cross-coalescent rate between the Yoruba (YRI) and Han (CHB) populations. (b) Posterior distribution of non-parametric ratio estimator. (c) Three-population out-of-Africa model used to generate simulated data.

**Figure 9:**
Analyzing ancient DNA using phlash. a) Histogram of the different mutation types in the ancient samples. Enrichment for $C \to T$ mutations are likely the result of deamination, as are $T \to C$ mutations after accounting for uncertainty in the ancestral allele. $G \leftrightarrow A$ mutations are likely the result of deaminated bases on the complementary strand. b) Size history estimates for the Vindija Neanderthal before and after filtering out probable deaminated bases.

**Figure 10:**
Searching for signs of an ancient bottleneck in real and simulated human data. (a) Histogram of posterior median of $\inf_{t > t_{anc}} N_{e} (t)$ for each of the 159 populations considered. (b-c) The estimated size history of the Yoruba/Han population (black line, estimated from real data) was modified to incorporate a bottleneck. Orange, green, and blue correspond to bottleneck strength $α \in \{10^{0}, 10^{- 1}, 10^{- 2}\}$ respectively. Grey shaded region is $[813 \times 10^{3}, 930 \times 10^{3}]$ years before present, assuming a generation time of 24 years. Dashed lines represent $α N_{e} (t)$ , the population size during the bottleneck, and solid lines represent the corresponding phlash estimates. (d) Mirror plot of posterior density of test statistic for Han and Yoruba. Shaded density curves are its distribution in simulated data; bar plots are a histogram of its distribution in real data.

See this image and copyright information in PMC

References

1. Adrion Jeffrey R et al. (2020). “A community-maintained standard library of population genetic models”. In: elife 9, e54967. - PMC - PubMed
1. Arredondo Armando et al. (2021). “Inferring number of populations and changes in connectivity under the n-island model”. In: Heredity 126.6, pp. 896–912. - PMC - PubMed
1. Al-Asadi Hussein et al. (2019). “Estimating recent migration and population-size surfaces”. In: PLoS genetics 15.1, e1007908. - PMC - PubMed
1. Baharian Soheil and Gravel Simon (2018). “On the decidability of population size histories from finite allele frequency spectra”. In: Theoretical Population Biology 120, pp. 42–51. - PubMed
1. Baumdicker Franz et al. (2022). “Efficient ancestry and mutation simulation with msprime 1.0”. In: Genetics 220.3, iyab229. - PMC - PubMed

Publication types

Actions

Grants and funding

R35 GM151145/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Accelerated Bayesian inference of population size history from recombining sequence data

Affiliation

Accelerated Bayesian inference of population size history from recombining sequence data

Author

Affiliation

Abstract

Figures

Similar articles

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources

This is a preprint.

Abstract

Figures

Similar articles

References

Publication types

Related information

Grants and funding

LinkOut - more resources

Full Text Sources