Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Feb 15:2025.02.14.638385.
doi: 10.1101/2025.02.14.638385.

A General Framework for Branch Length Estimation in Ancestral Recombination Graphs

Affiliations

A General Framework for Branch Length Estimation in Ancestral Recombination Graphs

Yun Deng et al. bioRxiv. .

Abstract

Inference of Ancestral Recombination Graphs (ARGs) is of central interest in the analysis of genomic variation. ARGs can be specified in terms of topologies and coalescence times. The coalescence times are usually estimated using an informative prior derived from coalescent theory, but this may generate biased estimates and can also complicate downstream inferences based on ARGs. Here we introduce, POLEGON, a novel approach for estimating branch lengths for ARGs which uses an uninformative prior. Using extensive simulations, we show that this method provides improved estimates of coalescence times and lead to more accurate inferences of effective population sizes under a wide range of demographic assumptions. It also improves other downstream inferences including estimates of mutation rates. We apply the method to data from the 1000 Genomes Project to investigate population size histories and differential mutation signatures across populations. We also estimate coalescence times in the HLA region, and show that they exceed 30 million years in multiple segments.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
Methodology overview of POLEGON. The tree sequence format of local genealogies on each non-recombining block (A) can be converted to a DAG (B), by merging the nodes with the same index across trees. The width of the edges indicates their spatial spans in the tree sequence. When updating the age of a particular node (C), its age can only move within the interval (D) determined by Eq. (2), as the blue branches must have positive branch length. The age of the node is perturbed in this interval with an MCMC algorithm to sample from Eq. (3). We denote the number of mutations mapped to the branch connecting node u and v by k(u,v), and the total length of the genomic segments where node u and v are adjacent by s(u,v), where u and v are unordered.
Figure 2:
Figure 2:
Calibration of distribution with ARG rescaling. (A) ARG rescaling transforms the original mis-specified posterior distribution closer to the correct distribution, even if the true coalescent prior is unknown; (B) The rank plot of the node ages against the node age samples from POLEGON, before (red) and after (blue) rescaling, under simulation under three different demography models: constant size, CEU model and YRI model; (C) The pairwise TMRCA distribution in simulation (black) compared to inferred with POLEGON, with (blue) and without (red) the ARG rescaling operation.
Figure 3:
Figure 3:
The inference accuracy comparison between Relate (green), SINGER (red) and SINGER+POLEGON (blue), in different aspects: pairwise TMRCA (A), pairwise TMRCA distribution (B), local diversity (C), and local mutation density (D).
Figure 4:
Figure 4:
Demography inference results. Simulation benchmark of demography inference with the CEU model (A) and the synthetic model (B), with the inference error from each method provided in the brackets with coalescence rate divergence. We also inferred the population size history for CEU (C) and YRI (D) in the 1000 Genomes Project.
Figure 5:
Figure 5:
Mutation rate trajectory inference and ancient coalescence times at HLA inferred with SINGER+POLEGON. (A) Performance of mutation rate trajectory inference when assuming uniform placement (red) and using the proposed iterative algorithm (blue); (B) Inferred mutation rate trajectory for TCC → TTC in CEU and YRI, with the proposed iterative algorithm; (C) The estimated average pairwise TMRCA in HLA region using SINGER (red) compared to SINGER+POLEGON (blue); (D) The number of observed variants versus predicted from inferred ARGs with SINGER in 10kb windows; (E) The number of observed variants versus predicted from inferred ARGs with POLEGON+SINGER in 10kb windows.

Similar articles

References

    1. Azevedo L., Serrano C., Amorim A., Cooper D.N., 2015. Trans-species polymorphism in humans and the great apes is generally maintained by balancing selection that modulates the host immune response. Human genomics 9, 1–6. - PMC - PubMed
    1. Byrska-Bishop M., Evani U.S., Zhao X., Basile A.O., Abel H.J., Regier A.A., Corvelo A., Clarke W.E., Musunuri R., Nagulapalli K., et al., 2022. High-coverage whole-genome sequencing of the expanded 1000 genomes project cohort including 602 trios. Cell 185, 3426–3440. - PMC - PubMed
    1. Cook S.R., Gelman A., Rubin D.B., 2006. Validation of software for bayesian models using posterior quantiles. Journal of Computational and Graphical Statistics 15, 675–692.
    1. Cousins T., Durvasula A., 2025. Insufficient evidence for a severe bottleneck in humans during the early to middle pleistocene transition. Molecular Biology and Evolution, msaf041. - PMC - PubMed
    1. Dempster A.P., Laird N.M., Rubin D.B., 1977. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society: series B (methodological) 39, 1–22.

Publication types