Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013:4:2304.
doi: 10.1038/ncomms3304.

PhyloPhlAn is a new method for improved phylogenetic and taxonomic placement of microbes

Affiliations

PhyloPhlAn is a new method for improved phylogenetic and taxonomic placement of microbes

Nicola Segata et al. Nat Commun. 2013.

Abstract

New microbial genomes are constantly being sequenced, and it is crucial to accurately determine their taxonomic identities and evolutionary relationships. Here we report PhyloPhlAn, a new method to assign microbial phylogeny and putative taxonomy using >400 proteins optimized from among 3,737 genomes. This method measures the sequence diversity of all clades, classifies genomes from deep-branching candidate divisions through closely related subspecies and improves consistency between phylogenetic and taxonomic groupings. PhyloPhlAn improved taxonomic accuracy for existing and newly sequenced genomes, detecting 157 erroneous labels, correcting 46 and placing or refining 130 new genomes. We provide examples of accurate classifications from subspecies (Sulfolobus spp.) to phyla, and of preliminary rooting of deep-branching candidate divisions, including consistent statistical support for Caldiserica (formerly candidate division OP5). PhyloPhlAn will thus be useful for both phylogenetic assessment and taxonomic quality control of newly sequenced genomes. The final phylogenies, conserved protein sequences and open-source implementation are available online.

PubMed Disclaimer

Conflict of interest statement

Conflict of interest

The authors declare no competing interests.

Figures

Figure 1
Figure 1. A high-resolution microbial tree of life with taxonomic annotations
We reconstruct and validate a bacterial and archaeal phylogeny leveraging subsequences from 400 broadly-conserved proteins determined using 2,887 genomes and applied on a total of 3,737 genomes. The tree is built using RAxML, with organisms colored based on phyla including at least 5 genomes. Scale indicates normalized fraction of total branch length. Gray labels indicate the lowest common ancestor of genera with at least 10 genomes (excluding predicted taxonomic mislabelings). External bar length represents the fraction of the 400 proteins contained in each genome. Red external triangles indicate genomes predicted by our method to be taxonomically mislabeled and confidently replaced; blue triangles indicate problematic labels that were refined but still did not fall within a fully consistent clade; green triangles indicate genomes whose incomplete taxonomic label we confidently refined; and black triangles indicate 566 genomes from IMG-GEBA that have been newly placed into the tree.
Figure 2
Figure 2. Selecting informative subsequences improves the accuracy of phylogenetic tree reconstruction
As compared to a gold standard derived from the IMG taxonomy, both precision (A) and recall (B) of inferred phylogenies increase at all taxonomic levels as up to the 500 most-conserved proteins are sampled (values averaged across all clades at each level). Comparison with full-length protein sequence phylogenies (up to 100 proteins) confirms that alignments subsampled at the most discriminative amino acids are both more accurate and more efficient. This approach outperforms single 16S rRNA gene phylogenies at all taxonomic levels, as well as trees based on curated ribosomal protein concatenation, for all but the most specific clades. (C) The relative phylogenetic diversity of all taxonomic levels is consistent across varying protein numbers and is on average remarkably logarithmic, providing quantitative support for the existing multi-level microbial taxonomy. (D) Relative phylogenetic diversity among individual clades at each taxonomic level, however, shows a tremendous range of diversities, with some underrepresented phyla comprising only as much sequence divergence among available genomes as some species. This suggests that while taxonomic levels are consistent on average, clade-specific diversity thresholds should be employed when linking phylogenetic divergence with individual taxonomic labels. Again, even the most diverse species reconstructed by this method are better resolved than those using the 16S rRNA gene alone, for which many demonstrate improbably high putative phylogenetic diversity.
Figure 3
Figure 3. Inferred phylum, genus, and species phylogenetic trees
(A) The inferred Actinobacteria phylum subtree, with genomes colored by family and genera annotated by root node. All 19 families are grouped consistently, which cannot be achieved by 16S gene sequences alone. (B) The Corynebacterium genus subtree, with highly concordant species and strain grouping not achieved by previous analyses. (C) Archaeal genomes of genus Sulfolobus, and (D) for S. islandicus, an inset of the inferred strain-level tree. For this particular organism, all 9 genomes group consistently according to the geography of their site of origin.
Figure 4
Figure 4. Accuracy of correctly re-inferred taxonomic labels for artificially-mislabeled organisms
Barplots report the percentages (with s.d.) of successfully-recovered cases. (A) For 5 iterations, 10 taxa are selected at random from species with 2, more than 2, or more than 5 genomes, and their species-level label removed. The PhyloPhlAn phylogenetic tree (which is built without any taxonomic information) is then used to re-impute the removed labels at medium, high, and very high confidence thresholds. No incorrect refinements are produced at the highest confidence threshold, and average recall rates for species with at least three taxa exceed 90% at high confidence. (B) We repeat this procedure by mislabeling (rather than removing labels for) species, genus, or family-level assignments. No false positives are produced at high or very high confidence, and only 2 over all experiments (<1%).

References

    1. Ochman H, Wilson Evolution in bacteria: evidence for a universal substitution rate in cellular genomes. J Mol Evol. 1987;26:74. - PubMed
    1. Schloss PD, Handelsman J. Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness. Appl Environ Microbiol. 2005;71:1501–1506. - PMC - PubMed
    1. Gogarten JP, Townsend JP. Horizontal gene transfer, genome innovation and evolution. Nat Rev Microbiol. 2005;3:679–687. - PubMed
    1. Iwasaki W, Takagi T. Rapid pathway evolution facilitated by horizontal gene transfers across prokaryotic lineages. PLoS Genetics. 2009;5:e1000402. - PMC - PubMed
    1. Gardy JL, et al. Whole-genome sequencing and social-network analysis of a tuberculosis outbreak. New Engl J Med. 2011;364:730. - PubMed

Publication types

Substances