Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Oct 4;40(10):msad211.
doi: 10.1093/molbev/msad211.

Inference of Coalescence Times and Variant Ages Using Convolutional Neural Networks

Affiliations

Inference of Coalescence Times and Variant Ages Using Convolutional Neural Networks

Juba Nait Saada et al. Mol Biol Evol. .

Abstract

Accurate inference of the time to the most recent common ancestor (TMRCA) between pairs of individuals and of the age of genomic variants is key in several population genetic analyses. We developed a likelihood-free approach, called CoalNN, which uses a convolutional neural network to predict pairwise TMRCAs and allele ages from sequencing or SNP array data. CoalNN is trained through simulation and can be adapted to varying parameters, such as demographic history, using transfer learning. Across several simulated scenarios, CoalNN matched or outperformed the accuracy of model-based approaches for pairwise TMRCA and allele age prediction. We applied CoalNN to settings for which model-based approaches are under-developed and performed analyses to gain insights into the set of features it uses to perform TMRCA prediction. We next used CoalNN to analyze 2,504 samples from 26 populations in the 1,000 Genome Project data set, inferring the age of ∼80 million variants. We observed substantial variation across populations and for variants predicted to be pathogenic, reflecting heterogeneous demographic histories and the action of negative selection. We used CoalNN's predicted allele ages to construct genome-wide annotations capturing the signature of past negative selection. We performed LD-score regression analysis of heritability using summary association statistics from 63 independent complex traits and diseases (average N=314k), observing increased annotation-specific effects on heritability compared to a previous allele age annotation. These results highlight the effectiveness of using likelihood-free, simulation-trained models to infer properties of gene genealogies in large genomic data sets.

Keywords: allele age; coalescence time; heritability; machine learning; natural selection.

PubMed Disclaimer

Figures

<sc>Fig</sc>. 1.
Fig. 1.
Overview of the CoalNN model. (a) CoalNN comprises a batch normalization layer followed by five convolution blocks (convolution layer + batch normalization + ReLU) and a final 1×1 convolution layer. The input sequence includes additional contextual data (denoted by ‘Context’ in the figure). The view offered here is simplified: in practice, a convolutional layer goes through all input channels and the outputs are summed to create one of the output channels. This process is repeated with a new convolutional layer for every output channel. (b) When making the output piecewise constant, CoalNN averages all inferred TMRCAs between consecutive genomic sites with an estimated probability of recombination that exceeds a user-specified threshold.
<sc>Fig</sc>. 2.
Fig. 2.
Procedure for dating genomic variants. (a) At a given genomic site, individuals are connected through underlying genealogical relationships. We aim to infer the time t at which a mutation arose (denoted by the star) and resulted in carrier haplotypes and noncarrier haplotypes. (b) When dating variants, CoalNN first infers TMRCAs across all concordant (two carriers) and discordant (one carrier and one noncarrier) pairs of haplotypes and then rejects outlier pairs using the heuristic approach developed in Albers and McVean (2020). The TMRCA rejection threshold is computed by minimizing the total number of rejected pairs. The predicted age estimate for the variant is obtained by averaging the maximum coalescence time across concordant pairs tc and the minimum coalescence time across discordant pairs td after filtering.
<sc>Fig</sc>. 3.
Fig. 3.
Pairwise TMRCA and allele age prediction on sequencing data. (a) True pairwise TMRCAs (x axis) versus those estimated by CoalNN and ASMC (y axis) under a European demographic model for one simulation. For TMRCA prediction performance of CoalNN and ASMC by decile of the true TMRCA distribution, see supplementary table 3, Supplementary Material online. (b) True nonsingleton variant ages (x axis) versus those estimated by CoalNN, Relate, and tsdate+tsinfer (y axis) under a constant diploid population size Ne=10,000.
<sc>Fig</sc>. 4.
Fig. 4.
Running time evaluation. Running time (in milliseconds) of CoalNN (on a single A100 GPU card and a single CPU, and on a single CPU only) and ASMC (on a single CPU, optimized and nonoptimized version) on array data using the first 30 Mbp of chromosome 2 across 6,749 SNPs. The batch size for both methods is 64.
<sc>Fig</sc>. 5.
Fig. 5.
Age distribution of dated variants among different population groups. (a) Cumulative age distribution function of all dated variants across the human genome per population group. For each line, only nonsingleton polymorphic variants present in that population within a given derived allele frequency bin were considered. (b) Differences in allele age distribution between pathogenic mutations (annotated as such by PolyPhen-2 and by SIFT) and neutral variants for a derived allele frequency between 1% and 2.5% within each population group.
<sc>Fig</sc>. 6.
Fig. 6.
S-LDSC analysis of CoalNN MAF-adjusted allele age annotations. (a) We report correlations computed on common SNPs (MAF5%) between each of the 26 population specific MAF-adjusted CoalNN annotations and evolutionary annotations from the baseline model. ARGweaver allele age, ASMCavg, and LLD-AFR annotations are also adjusted for MAF. Numerical results are reported in supplementary table 13, Supplementary Material online. (b) Effect size τ* estimates (meta-analyzed across 63 independent diseases and complex traits listed in supplementary table 2, Supplementary Material online) of CoalNN MAF-adjusted allele age annotation on all 26 populations and of ARGweaver MAF-adjusted allele age annotation (Rasmussen et al. 2014), in marginal S-LDSC analysis conditioned on 96 baseline annotations (the full baseline model except for ARGweaver) (Gazal et al. 2017). We also report effect sizes of baselineLD evolutionary annotations [level of LD measured in African populations LLD-AFR, recombination rate, nucleotide diversity, B-statistic (McVicker et al. 2009), CpG content (Zhang et al. 2021), and average pairwise TMRCA ASMCavg (Palamara and Terhorst 2018)] after the introduction of either the CoalNN or ARGWeaver allele age annotation. Error bars represent standard errors of the meta-analyzed τ* estimates. See supplementary table 14, Supplementary Material online for numerical results.

Similar articles

Cited by

References

    1. 1000 Genomes Project Consortium . 2015. A global reference for human genetic variation. Nature 526(7571):68. - PMC - PubMed
    1. Adrion JR, Galloway JG, Kern AD. 2020. Predicting the landscape of recombination using deep learning. Mol Biol Evol. 37(6):1790–1808. - PMC - PubMed
    1. Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR. 2010. A method and server for predicting damaging missense mutations. Nat Methods. 7(4):248–249. - PMC - PubMed
    1. Albers PK, McVean G. 2020. Dating genomic variants and shared ancestry in population-scale sequencing data. PLoS Biol. 18(1):e3000586. - PMC - PubMed
    1. Albrechtsen A, Moltke I, Nielsen R. 2010. Natural selection and the distribution of identity-by-descent in the human genome. Genetics 186(1):295–308. - PMC - PubMed

Publication types