Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2022 Oct 10;377(1861):20210244.
doi: 10.1098/rstb.2021.0244. Epub 2022 Aug 22.

Recent progress on methods for estimating and updating large phylogenies

Affiliations
Review

Recent progress on methods for estimating and updating large phylogenies

Paul Zaharias et al. Philos Trans R Soc Lond B Biol Sci. .

Abstract

With the increased availability of sequence data and even of fully sequenced and assembled genomes, phylogeny estimation of very large trees (even of hundreds of thousands of sequences) is now a goal for some biologists. Yet, the construction of these phylogenies is a complex pipeline presenting analytical and computational challenges, especially when the number of sequences is very large. In the past few years, new methods have been developed that aim to enable highly accurate phylogeny estimations on these large datasets, including divide-and-conquer techniques for multiple sequence alignment and/or tree estimation, methods that can estimate species trees from multi-locus datasets while addressing heterogeneity due to biological processes (e.g. incomplete lineage sorting and gene duplication and loss), and methods to add sequences into large gene trees or species trees. Here we present some of these recent advances and discuss opportunities for future improvements. This article is part of a discussion meeting issue 'Genomic population structures of microbial pathogens'.

Keywords: maximum likelihood; multiple sequence alignment; phylogenetic placement; phylogenomics; phylogeny estimation; taxon identification.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Average alignment error on 19 datasets with 10 099–93 681 sequences. The datasets are from the Homfam [24] collection of benchmark protein datasets with alignments defined by secondary and tertiary protein structures. Alignment error is based on pairwise homology statements for each alignment, where two letters that are in the same column of an alignment are considered homologous according to that alignment. The fraction of the pairwise homologies (defined by the reference alignment) that are not in the estimated alignment is the sum-of-pairs false negative (SPFN) error rate, and the fraction of the pairwise homologies in the estimated alignment that are not in the reference alignment is the sum-of-pairs false positives (SPFP) error rate. Results are averaged over the datasets where all methods completed (Muscle segfaulted on two). Error bars show standard error. Reproduced from Smirnov [18] under the Creative Commons Attribution License.
Figure 2.
Figure 2.
DTM pipeline for constructing a tree from an input sequence alignment using ML. (1) A starting tree is computed (e.g. using FastTree 2 or IQ-TREE 2 [35]). (2) Edges are deleted from the starting tree to produce small subsets. (3) Trees are estimated on the subsets using a selected ML method (e.g. IQ-TREE 2 or RAxML-NG). (4) The selected 'disjoint tree mergers' (DTM) method merges the disjoint trees into a tree on the full dataset. DTM pipelines that operate from multi-locus inputs and compute species trees have also been developed, with suitable adjustments to the algorithmic steps. Reproduced from Park et al. [31] under the Creative Commons Attribution License.
Figure 3.
Figure 3.
Comparison of standard ML methods (RAxML-NG, IQ-TREE 2 and FastTree 2) to a divide-and-conquer pipeline using the guide tree merger (GTM) on four simulated datasets with 1000–50 000 sequences. 1000M1-HF datasets each have 1000 sequences that evolved under a GTRGAMMA+indel model and include fragmentary sequences, Cox1-HET datasets each have 2341 sequences that evolved with heterotachy, and the RNASim [16] datasets have 10 000–50 000 sequences each and evolved under selective pressures to maintain the RNA secondary structure. Top: running time (hours), bottom: missing branch (FN) error rates across 10 replicates per model condition. Results not shown for IQ-TREE 2 and RAxML on the RNASim 50K dataset because IQ-TREE 2 failed to return a tree within the allowed time (24 h for the two smaller datasets and 168 h for the two larger datasets) and RAxML-NG produced trees with at least 99.96% FN error. Adapted from [31] under the Creative Commons Attribution License. (Online version in colour.)
Figure 4.
Figure 4.
Species tree error (Robinson–Foulds (RF) error rates), wall clock running time (s) and peak memory usage of ASTRAL-Pro, ASTRID-DISCO and SpeciesRax on simulated datasets (evolved under GDL and ILS) of 1001 species and 50 estimated gene trees. All estimated and model trees are fully resolved, so the RF error rate is the fraction of bipartitions defined by internal edges of the model tree that are not in the estimated tree. Reproduced from [75] under the Creative Commons Attribution Non-Commercial License. (Online version in colour.)

References

    1. Maddison WP. 1997. Gene trees in species trees. Syst. Biol. 46, 523-536. (10.1093/sysbio/46.3.523) - DOI
    1. Nabhan AR, Sarkar IN. 2012. The impact of taxon sampling on phylogenetic inference: a review of two decades of controversy. Brief. Bioinform. 13, 122-134. (10.1093/bib/bbr014) - DOI - PMC - PubMed
    1. Lefort V, Desper R, Gascuel O. 2015. FastME 2.0: a comprehensive, accurate, and fast distance-based phylogeny inference program. Mol. Biol. Evol. 32, 2798-2800. (10.1093/molbev/msv150) - DOI - PMC - PubMed
    1. Lees JA, Kendall M, Parkhill J, Colijn C, Bentley SD, Harris SR. 2018. Evaluation of phylogenetic reconstruction methods using bacterial whole genomes: a simulation based study. Wellcome Open Res. 2018, 3:33. (10.12688/wellcomeopenres.14265.2) - DOI - PMC - PubMed
    1. Bader DA, Madduri K. 2019. High-performance phylogenetic inference. In Bioinformatics and phylogenetics (ed. Warnow T), pp. 39-46. Berlin, Germany: Springer. (10.1007/978-3-030-10837-3_3) - DOI

Publication types