. 2024 Aug 2;41(8):msae156.

doi: 10.1093/molbev/msae156.

Fast and Accurate Estimation of Selection Coefficients and Allele Histories from Ancient and Modern DNA

Andrew H Vaughn¹, Rasmus Nielsen^{2

3}

Affiliations

¹ Center for Computational Biology, University of California, Berkeley, CA 94720, USA.
² Departments of Integrative Biology and Statistics, University of California, Berkeley, CA 94720, USA.
³ Center for GeoGenetics, University of Copenhagen, Copenhagen DK-1350, Denmark.

PMID: 39078618
PMCID: PMC11321360
DOI: 10.1093/molbev/msae156

Fast and Accurate Estimation of Selection Coefficients and Allele Histories from Ancient and Modern DNA

Andrew H Vaughn et al. Mol Biol Evol. 2024.

. 2024 Aug 2;41(8):msae156.

doi: 10.1093/molbev/msae156.

Authors

Andrew H Vaughn¹, Rasmus Nielsen^{2

3}

Affiliations

¹ Center for Computational Biology, University of California, Berkeley, CA 94720, USA.
² Departments of Integrative Biology and Statistics, University of California, Berkeley, CA 94720, USA.
³ Center for GeoGenetics, University of Copenhagen, Copenhagen DK-1350, Denmark.

PMID: 39078618
PMCID: PMC11321360
DOI: 10.1093/molbev/msae156

Abstract

We here present CLUES2, a full-likelihood method to infer natural selection from sequence data that is an extension of the method CLUES. We make several substantial improvements to the CLUES method that greatly increases both its applicability and its speed. We add the ability to use ancestral recombination graphs on ancient data as emissions to the underlying hidden Markov model, which enables CLUES2 to use both temporal and linkage information to make estimates of selection coefficients. We also fully implement the ability to estimate distinct selection coefficients in different epochs, which allows for the analysis of changes in selective pressures through time, as well as selection with dominance. In addition, we greatly increase the computational efficiency of CLUES2 over CLUES using several approximations to the forward-backward algorithms and develop a new way to reconstruct historic allele frequencies by integrating over the uncertainty in the estimation of the selection coefficients. We illustrate the accuracy of CLUES2 through extensive simulations and validate the importance sampling framework for integrating over the uncertainty in the inference of gene trees. We also show that CLUES2 is well-calibrated by showing that under the null hypothesis, the distribution of log-likelihood ratios follows a χ2 distribution with the appropriate degrees of freedom. We run CLUES2 on a set of recently published ancient human data from Western Eurasia and test for evidence of changing selection coefficients through time. We find significant evidence of changing selective pressures in several genes correlated with the introduction of agriculture to Europe and the ensuing dietary and demographic shifts of that time. In particular, our analysis supports previous hypotheses of strong selection on lactase persistence during periods of ancient famines and attenuated selection in more modern periods.

Keywords: ARGs; HMMs; ancient DNA; lactase persistence; selection.

PubMed Disclaimer

Figures

**Fig. 1.**
Violin plots showing the results of running CLUES2 on ancient genotype data. Boxplots are overlayed with the whiskers omitted. True values of s are shown as dashed lines. Thirty replicates were performed for each true value of s. Simulations were run with a) $N = 50,000$ and two individuals sampled every generation and with b) $N = 6, 00,000$ and ten individuals sampled every generation.

**Fig. 2.**
An outline of the labeling of branches and nodes in a tree as either ancestral (blue) or derived (orange) given a labeling of the leaf nodes. An internal node is a derived coalescence if and only if all its descendant leaves are derived. The parent node of the oldest derived coalescence is the mixed coalescence node (represented in black). All other coalescence events are ancestral coalescences. A branch represents a derived lineage if and only if it has a derived node as an ancestor. The mixed lineage is the immediate parent branch of the oldest derived coalescence (black dashed line). All other branches represent ancestral lineages.

**Fig. 3.**
Violin plots showing the results of running CLUES2 on true trees. Boxplots are overlayed with the whiskers omitted. True values of s are shown as dashed lines. Thirty replicates were performed for each true value of s. Simulations were run with a) $N = 30,000$ and 120 sampled leaves and b) $N = 6, 00,000$ and 800 samples leaves. A modern allele frequency of 0.75 was used for each simulation.

**Fig. 4.**
Violin plots showing the results of running CLUES2 on inferred topologies. Boxplots are overlayed with the whiskers omitted. True values of s are shown as dashed lines. Thirty replicates were performed for each true value of s. Simulations were run with a) $μ = 3 \times 10^{- 6}$ and one sample taken without importance sampling, b) $μ = 4 \times 10^{- 9}$ and one sample taken without importance sampling, and c) $μ = 4 \times 10^{- 9}$ and 600 samples taken and used in the importance sampling framework. A modern allele frequency of 0.75, a population size of $N = 40,000$ , and 24 leaves were used for each simulation.

**Fig. 5.**
Violin plots showing the results of running CLUES2 on a combination of modern and ancient data. Boxplots are overlayed with the whiskers omitted. True values of s are shown as dashed lines. Fifty replicates were performed, with each replicate generating an estimate for each of the three selection coefficients. We run simulations a) where the ancient data is incorporated into the tree and b) where the ancient data is treated only as genotype emissions. $N = 40,000$ , 20 modern leaves are used, and 80 ancient leaves or 40 ancient genotypes are sampled at each of the times 50 and 100 generations before the present.

**Fig. 6.**
Violin plots showing the results of running CLUES2 on ancient genotype data simulated with differing selection coefficients through time. Boxplots are overlayed with the whiskers omitted. True values of s are shown as dashed lines. Thirty replicates were performed, with each replicate generating an estimate for each of the three selection coefficients. A population size of $N = 70,000$ was used and eight diploid individuals were sampled in each generation.

**Fig. 7.**
Illustration of Approximations A1, A2, and B. Approximation A1 approximates the transition matrix by a sparse banded matrix. Approximation A2 reduces the number of states in the previous column of F that are summed over to compute each entry of the forward matrix F. Approximation B reduces the number of entries that are computed in a column of the forward matrix F based on the probability density of the previous column of F. Here, the gray entries represent values that are computed, while the colored entries and arrows represent transition or forward probabilities. Lighter colors denote higher probabilities, and darker colors denote smaller probabilities.

**Fig. 8.**
Comparative runtime of CLUES2 on different numbers of importance samples both with and without the stated approximations.

**Fig. 9.**
Comparison of the estimates of the log-likelihood function of our dataset of 100 importance samples both with and without our HMM approximations. The log-likelihood was evaluated at 60 values of s spaced equally between 0.014 and 0.04 for each case, and the plots of the functions were generated via linear interpolation.

**Fig. 10.**
Comparison of the ability of CLUES2 and several summary statistic-based methods to infer selection coefficients. We infer selection coefficients from the summary statistics using the ABC algorithm described above. The dataset is identical to that analyzed in Fig. 4c. The true simulated values of the selection coefficients are shown as horizontal dashed lines.

See this image and copyright information in PMC

References

1. Akaike H. A new look at the statistical model identification. IEEE Trans Auto Control. 1974:19(6):716–723. 10.1109/tac.1974.1100705. - DOI
1. Allentoft ME, Sikora M, Refoyo-Martínez A, Irving-Pease EK, Fischer A, Barrie W, Ingason A, Stenderup J, Sjöygren K-G, Pearson A, et al. Population genomics of post-glacial western Eurasia. Nature. 2024:625(7994):301–311. 10.1038/s41586-023-06865-0. - DOI - PMC - PubMed
1. Baum LE. An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. New York: Academic Press; 1972.
1. Baumdicker F, Bisschop G, Goldstein D, Gower G, Ragsdale AP, Tsambos G, Zhu S, Eldon B, Ellerman EC, Galloway JG, et al. Efficient ancestry and mutation simulation with msprime 1.0. Genetics. 2021:220(3):iyab229. 10.1093/genetics/iyab229. - DOI - PMC - PubMed
1. Bergman J, Schrempf D, Kosiol C, Vogl C. Inference in population genetics using forward and backward, discrete and continuous time processes. J Theor Biol. 2018:439:166–180. 10.1016/j.jtbi.2017.12.008. - DOI - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Fast and Accurate Estimation of Selection Coefficients and Allele Histories from Ancient and Modern DNA

Affiliations

Fast and Accurate Estimation of Selection Coefficients and Allele Histories from Ancient and Modern DNA

Authors

Affiliations

Abstract

Figures

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources