Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jun 30;39(39 Suppl 1):i185-i193.
doi: 10.1093/bioinformatics/btad221.

Phylogenomic branch length estimation using quartets

Affiliations

Phylogenomic branch length estimation using quartets

Yasamin Tabatabaee et al. Bioinformatics. .

Abstract

Motivation: Branch lengths and topology of a species tree are essential in most downstream analyses, including estimation of diversification dates, characterization of selection, understanding adaptation, and comparative genomics. Modern phylogenomic analyses often use methods that account for the heterogeneity of evolutionary histories across the genome due to processes such as incomplete lineage sorting. However, these methods typically do not generate branch lengths in units that are usable by downstream applications, forcing phylogenomic analyses to resort to alternative shortcuts such as estimating branch lengths by concatenating gene alignments into a supermatrix. Yet, concatenation and other available approaches for estimating branch lengths fail to address heterogeneity across the genome.

Results: In this article, we derive expected values of gene tree branch lengths in substitution units under an extension of the multispecies coalescent (MSC) model that allows substitutions with varying rates across the species tree. We present CASTLES, a new technique for estimating branch lengths on the species tree from estimated gene trees that uses these expected values, and our study shows that CASTLES improves on the most accurate prior methods with respect to both speed and accuracy.

Availability and implementation: CASTLES is available at https://github.com/ytabatabaee/CASTLES.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
(a) MSC + Substitution model. Each branch of the species tree is furnished with parameters described in the legend. As a gene tree evolves inside the species tree, its branches inherit the substitution rates of all the species tree branches that they pass through. When mutation rates change across species tree branches, the resulting gene tree is non-ultrametric. We match the theoretical expected values of the five branches of a gene tree that matches or does not match the species tree (namely, LA,LB,LC,LD, and LI for a matching gene tree shown here) to their empirical means, computed from gene trees. (b) Handling a tree with more than four taxa. Each focal internal branch (arrow) divides the tree into four groups, here denoted as A, B, C, and D. To use quartet-based equations, we average branch lengths over all quartets with one leaf selected from each of A, B, C, and D (e.g. a1,b1,c1,d1). Note that in one gene tree, some quartets around a species tree branch may contribute to matching while others contribute to non-matching average lengths (examples shown). We compute these averages efficiently without listing all O(n4) quartets using dynamic programming.
Figure 2.
Figure 2.
Quartet datasets: mean absolute error (a) and bias (b) of branch lengths estimated using different methods. From left to right, the conditions include more rate variation or higher ILS, creating more challenges for branch length estimation. (a) Mean and standard error across replicates in addition to boxplots. The y-axis is cut at 0.25, eliminating 16 outlier cases with unusually high errors (none from CASTLES). (b) Mean and standard deviation.
Figure 3.
Figure 3.
101-taxon datasets: mean and standard error of mean absolute error (a) and mean and standard deviation of bias (b) of branch lengths estimated using different methods. The average GTEE level varies between 0% (for true gene trees) to 23% (for 1600 bp) and then to 55% (for the 200 bp sequences). The number of genes is 1000 and the results are shown across 50 replicates. The y-axis is cut at 0.045, eliminating ten outlier cases (none from CASTLES). (c) Running time (log scale), including distance matrix calculation and species tree annotation (by mean branch lengths) but not gene tree estimation; concatenation includes branch length estimation for fixed topology.
Figure 4.
Figure 4.
30-taxon MVRoot dataset. (a) Mean absolute error of estimated branch lengths on the 30-taxon MVRoot dataset, with or without an outgroup and with different levels of deviation from a strict clock. The number of genes is 500 and the results are shown across 100 replicates; the y-axis is cut at 0.11, leaving 16 outliers out of the graph (one from CASTLES). (b) Focusing on cases without outgroups, we divide replicates based on their level of true gene tree discordance due to ILS into four groups. We show mean log error to control for the correlation between ILS and branch length. Patristic(MIN) + FastME has mean log error above 2 (see Supplementary Fig. S18) and is excluded.

References

    1. Binet M, Gascuel O, Scornavacca C. et al. Fast and accurate branch lengths estimation for phylogenomic trees. BMC Bioinformatics 2016;17:23. - PMC - PubMed
    1. Bromham L, Penny D.. The modern molecular clock. Nat Rev Genet 2003;4:216–24. - PubMed
    1. Faith DP. Quantifying biodiversity: a phylogenetic perspective. Conserv Biol 2002;16:248–52. - PubMed
    1. Felsenstein J. Phylogenies and the comparative method. Am Nat 1985;125:1–147. - PubMed
    1. Hahn MW, De Bie T, Stajich JE. et al. Estimating the tempo and mode of gene family evolution from comparative genomic data. Genome Res 2005;15:1153–60. - PMC - PubMed

Publication types