Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 May;42(5):768-777.
doi: 10.1038/s41587-023-01868-8. Epub 2023 Jul 27.

Generation of accurate, expandable phylogenomic trees with uDance

Affiliations

Generation of accurate, expandable phylogenomic trees with uDance

Metin Balaban et al. Nat Biotechnol. 2024 May.

Erratum in

Abstract

Phylogenetic trees provide a framework for organizing evolutionary histories across the tree of life and aid downstream comparative analyses such as metagenomic identification. Methods that rely on single-marker genes such as 16S rRNA have produced trees of limited accuracy with hundreds of thousands of organisms, whereas methods that use genome-wide data are not scalable to large numbers of genomes. We introduce updating trees using divide-and-conquer (uDance), a method that enables updatable genome-wide inference using a divide-and-conquer strategy that refines different parts of the tree independently and can build off of existing trees, with high accuracy and scalability. With uDance, we infer a species tree of roughly 200,000 genomes using 387 marker genes, totaling 42.5 billion amino acid residues.

PubMed Disclaimer

Conflict of interest statement

Competing Interests Statement

The authors declare no competing interests.

Figures

Figure Extended Figure 1:
Figure Extended Figure 1:
(A) nRF and (B) QD distance between estimated and true gene trees for all partition-gene pairs in all model conditions in the simulated dataset. RAxML-ng is used inside uDance (u) and on subsets whereas FT2 (A) is computed on the full dataset. The calculation of errors is always on the subsets to obtain fair comparisons.
Figure Extended Figure 2:
Figure Extended Figure 2:
Inserting a small number of queries on a large tree using uDance in auto and fast mode. We selected the 16,000 genomes from the analysis used in Figure 2g and updated it with {5×2i1i8,i} genomes using uDance in the auto (standard) and fast insertion mode (where the only difference is that partition sizes are set to 100). We measured the delta error for each query, which is defined as the change in RF distance between the true tree and inferred tree after placement of the query sequence. We show the mean delta error versus CPU time for various query set sizes. The running time grows slower with the −fast mode with out a significant sacrifice in accuracy. Whether the fast or the default modes is used, the accuracy is substantially higher when we allow the backbone tree to change. In fact, the accuracy improves after addition if the update mode is used whereas the accuracy stays the same or degrades with the incremental model or simple placement.
Figure Extended Figure 3:
Figure Extended Figure 3:
A) The 10K ASTRAL tree decorated with GTDB taxonomy. B) All by all comparison between the 10k, 16k, 200k, and GTDB trees on NCBI defined phyla and super-phyla.
Figure Extended Figure 4:
Figure Extended Figure 4:
ECDF of depth and the branch length of agreeing and disagreeing branches between the backbone and output phylogenies.
Figure Extended Figure 5:
Figure Extended Figure 5:
The paraphyletic Myxococcota phylum on the GTDB phylogenetic tree. Green and red sequences represent the members of the phylum that are proximal Desulfobacteria and Proteobacteria respectively in the 200k tree.
Figure Extended Figure 6:
Figure Extended Figure 6:
A) ECDF of branch support (local posterior probability) across partitions of the 16k tree. B) Branch support (local posterior probability) versus the diversity (average branch length) of all 78 partitions in the 200k tree. The dot and the range indicates median and 0.25–0.75 quantiles. Three colors correspond to clusters that are unusually small, unusually large, or the expected size. 14 of 15 partitions with the lowest diversity are of size between 2,500 and 6,000. The largest partitions in the 200k tree are over-represented parts of the tree of life in the reference genomic library that did not break into smaller partitions by uDance because of their lower diversity. C) Number of uncollapsed branches vs branch support (localPP) collapsing threshold. The comparison of number of uncollapsed branches vs branch support collapsing threshold for 16k and 200k tree.
Figure Extended Figure 7:
Figure Extended Figure 7:
Divide-and-conquer approach permits heterogeneity of model of evolution parameters across the tree. In discrete four-category LG+GAMMA model, the rate heterogeneity across sites is modeled by a discrete approximation of the Gamma distribution. The first and the fourth discrete rate of LG+G model for every partition and gene pair in the 16k tree are shown in (a) and (b), respectively. The partition hierarchy created by uDance is shown on the left. A blank cell indicates a missing gene in a partition.
Figure Extended Figure 8:
Figure Extended Figure 8:
A) Outgroup taxa selection strategy. Two to three taxa are chosen from the partition c2 (blue) to be added to the partition c1 (orange). B) Finding junction node v in Constrained ASTRAL tree for color (partition) c. We illustrate the setup for Claim 2 and 3. C) Stitching happens at junction nodes. After removing taxa placed on outgroup branches, other subtrees can be stitched to this subtree without any need for conceptual merge, but simply replacing the connecting nodes.
Figure Extended Figure 9:
Figure Extended Figure 9:
A) Determining the backbone size in the simulated HD-100 dataset. ECDF of novelty of query sequences with respect to a backbone tree of N downsampled sequences induced from the full HGT dataset. With N = 1000 sequences selected using TreeCluster-max, for more than 95% of the query sequences, the novelty score is less than one. Novelty score is defined as two times the terminal branch length of the query when placed on the true location on the backbone tree. B) The distribution of number of marker genes per sequence in WoL2 dataset. C) Two dot plots comparing (1) contamination ratio-vs-CSS and (2) contamination ratio-vs-GUNC database identity for the species in the 16K tree that are ”chimeric” (CSS > 0.45). We colored each point based on whether the sequence passed QC in GTDB or not. Triangle points are the published WoL tree, and round points are the new 6K taxa we added in the 16K tree. In these figures, We annotated 17 taxa in the 16K tree that might be reducing the accuracy of uDance and APPLES-2 in large clusters (subtrees) that include some of the densely sampled species such as Salmonella, E. coli, TB, etc. The pattern is clear that these contaminated genomes can be characterized by a large contamination ratio, near 100% CSS, and high database identity. We do not remove high CSS taxa if their contamination percentage is low, since uDance performs whole-genomebased placement, and it’s tolerant to low levels of contamination. Removing taxa satisfying both CSS ≥ 0.5 and Contamination ratio ≥ 0.25 removes 195 taxa from the 16K tree. 171 of them (87%) fail QC in GTDB. Of 195, 37 taxa are also present in WoL tree. 29 of these 37 don’t pass GTDB QC. D) Determining “best” marker genes to be used with APPLES-2 in order to improve placement speed. We picked a local maxima of average Archaea occupancy at 68th marker gene, which also ensures that, on average, Archaea sequences have at least 20 marker genes. The set of Archaea used in computation of these two statistics are taken from WoL tree. The name of the genes (shown on x-axis) are not important and can be ignored.
Figure 1:
Figure 1:
uDance overview. a) Updates to the phylogenetic tree through time (T0, T1, and T2) with new sequences arriving and each tree used as the backbone tree in the next step. Some sequences may be unplaceable and are added as query to the next iteration. b) Each update involves divide-and-conquer and several steps, most of which can be executed in a distributed fashion. The new sequences are independently added to the tree using phylogenetic placement. Then, the tree with placements is divided into partitions, and each partition plus enough outgroups from other partitions are reanalyzed to infer local gene trees and a local species tree. These species trees are next joined together using a constrained search that makes it possible to stitch back together the subtrees.
Figure 2:
Figure 2:
Results on simulation data set: three model condition with low (LD), mid (MD), and high (HD) discordance with 100 and 500 genes. HD-P1 to HD-P5 are 100 gene subsets of HD-500 with successively higher levels of gene tree discordance. (a) normalized Robinson-Foulds (nRF) and Quartet distance (QD) between inferred and true species tree. We show all points, mean (long dash) and the first and third quartiles (whiskers) over n = 10 independent replicates. The number above x-axis indicates the number of replicates in which each method failed to return a tree in 2 days, given 125GB of memory and reduces n. (b) The timeline of CPU usage on replicate 7 in HD-100 data set. Maximum number of cores made available for uDance is 672. (c) Cumulative CPU time (bars) and peak memory (dots and the horizontal bar) used by each method in n = replicates where concatenation method completed on HD-500 and FT2+ASTRAL method completed on HD-100 data set (mean shown as dashed line). FT2+ASTRAL failed in HD-500 for all replicates. For uDance, running time calculations include the time spent on backbone estimation. (d) The result of serial (online) tree inference on simulated data, showing nRF and QD Error, (e) wall-clock time, CPU time, and memory use versus output tree size. Staring from 250-sequence de novo backbone, every succeeding tree uses the preceding uDance output tree as the backbone. Missing points indicate failure to finish in the allotted 48-hour wall-time (FT2 was allowed 49 hours in one case). CPU time is aggregated cores across all nodes. In log-log plots, the slope of the line is shown for each method. Experiments are performed on a HPC with 45 exclusive nodes, each with Intel(R) Xeon(R) 16 cores E5–2670 2.60GHz CPU and 128 GB RAM.
Figure 3:
Figure 3:
New trees of microbial life. (a) Increased taxon sampling with respect to the 10k (Web of Life) tree. (b,c) 16k and 200k trees built using uDance. The tree is decorated with GTDB taxonomy v207 (with _A, _B, etc. suffixes removed) and collapses at phylum level for small phyla and class level for large phyla. Number of genomes in each group is shown, and taxon-holder names with less than 50 genomes were left unlabelled unless they disrupt monophyly of another phylum. Paraphyletic clades in our tree are given a numerical index. (d) Consistency of the large phyla and super-phyla in NCBI taxonomy database with various microbial phylogenies. In (a) and (d), we only show groups with at least 20 members in the 16K tree.
Figure 4:
Figure 4:
(a) Quartet score between the 16k species tree and n = 387 independent gene trees for categorized by the gene function as well as “All” genes (b) Running time of individual steps. ECDF of (c) depth (root-to-node distance) and (d) the branch length, separated for branches agreeing and disagreeing between the backbone 10k and the output 16k trees. (e) The portion of branches with support above a threshold (x-axis) in uDance trees. (f) Quartet distance between the 200k tree and other trees, restricted to NCBI phyla and super-phyla. (g) Gene tree discordance for every partition (rows) and gene trees (columns) in the 16k tree with the partition spanning tree shown on the left. (h) Heterogeneity of the best fit model of evolution across partitions of the tree in 1,000 plant (1KP) dataset. Boxplots show median (centre), interquartile range; IQR (bounds of box and 1.5× IQR (whiskers).

References

    1. Gonzalez A. et al. Qiita: rapid, web-enabled microbiome meta-analysis. Nature Methods 15, 796–798 (2018). URL https://www.nature.com/articles/s41592-018-0141-9. - PMC - PubMed
    1. Zhu Q. et al. Phylogeny-Aware Analysis of Metagenome Community Ecology Based on Matched Reference Genomes while Bypassing Taxonomy. mSystems 7, 1 (2022). URL https://journals.asm.org/doi/10.1128/msystems.00167-22. - DOI - PMC - PubMed
    1. Nayfach S, Shi ZJ, Seshadri R, Pollard KS & Kyrpides NC New insights from uncultivated genomes of the global human gut microbiome. Nature 568, 505–510 (2019). URL 10.1038/s41586-019-1058-x. - DOI - PMC - PubMed
    1. DeSantis TZ et al. Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB. Appl. Environ. Microbiol 72, 5069–5072 (2006).URL http://aem.asm.org/cgi/content/abstract/72/7/5069http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1489311/. - PMC - PubMed
    1. Quast C. et al. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Research 41, D590–D596 (2012). URL http://academic.oup.com/nar/article/41/D1/D590/1069277/The-SILVA-ribosom.... - PMC - PubMed

Grants and funding