Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Oct 17:10:1022.
doi: 10.3389/fgene.2019.01022. eCollection 2019.

Embracing Ambiguity in the Taxonomic Classification of Microbiome Sequencing Data

Affiliations

Embracing Ambiguity in the Taxonomic Classification of Microbiome Sequencing Data

Nidhi Shah et al. Front Genet. .

Abstract

The advent of high throughput sequencing has enabled in-depth characterization of human and environmental microbiomes. Determining the taxonomic origin of microbial sequences is one of the first, and frequently only, analysis performed on microbiome samples. Substantial research has focused on the development of methods for taxonomic annotation, often making trade-offs in computational efficiency and classification accuracy. A side-effect of these efforts has been a reexamination of the bacterial taxonomy itself. Taxonomies developed prior to the genomic revolution captured complex relationships between organisms that went beyond uniform taxonomic levels such as species, genus, and family. Driven in part by the need to simplify computational workflows, the bacterial taxonomies used most commonly today have been regularized to fit within a standard seven taxonomic levels. Consequently, modern analyses of microbial communities are relatively coarse-grained. Few methods make classifications below the genus level, impacting our ability to capture biologically relevant signals. Here, we present ATLAS, a novel strategy for taxonomic annotation that uses significant outliers within database search results to group sequences in the database into partitions. These partitions capture the extent of taxonomic ambiguity within the classification of a sample. The ATLAS pipeline can be found on GitHub [https://github.com/shahnidhi/outlier_in_BLAST_hits]. We demonstrate that ATLAS provides similar annotations to phylogenetic placement methods, but with higher computational efficiency. When applied to human microbiome data, ATLAS is able to identify previously characterized taxonomic groupings, such as those in the class Clostridia and the genus Bacillus. Furthermore, the majority of partitions identified by ATLAS are at the subgenus level, replacing higher-level annotations with specific groups of species. These more precise partitions improve our detection power in determining differential abundance in microbiome association studies.

Keywords: 16S rRNA marker gene; classification; high-throughput sequencing; microbiome; taxonomy.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Schematic diagram of the ATLAS pipeline. ATLAS takes in query sequences from a marker gene and searches them against a reference database to identify outlier sequences. It then constructs a graph of database sequences and clusters those that are commonly identified together into partitions.
Figure 2
Figure 2
Schematic detailing when ATLAS will provide the greatest improvement to taxonomic annotation. Shown is a simple example of a phylogenetic tree with taxonomic information of reference sequences, where the leaves are actual sequences in the database. When a query sequence (yellow stars) has near neighbors in the reference, such as Q1, most algorithms will be able to correctly classify the sequence. However, if a sequence, such as Q2, does not have many near neighbors in the database, computationally expensive phylogenetic methods are required for accurate placement (blue arrows) and annotation. ATLAS captures groups (or partitions) of database sequences (red nodes) that are commonly confused during the annotation process and assigns them to the query sequence (square node for Q1 and diamond nodes for Q2). Black triangles show collapsed portion of the tree. While this schematic is overly simplified and real phylogenies are much more complex, this is illustrating that ATLAS will provide additional information when query sequences do not have near neighbors in the database. This represents ideal cases, where 16S rRNA phylogeny and taxonomic annotations are congruent.
Figure 3
Figure 3
ATLAS generates classifications similar to phylogenetic placement methods at an improved speed. Taxonomic labels assigned by TIPP and ATLAS agree at all taxonomic levels for both (A) GEMS and (B) HMP datasets. (C) The ATLAS pipeline adds minimal post-processing time (in seconds) to standard BLAST analyses, but significantly outperforms TIPP.
Figure 4
Figure 4
ATLAS partitions for HMP and GEMS data typically capture subgenera information. Most partitions have the most recent common ancestor at the genus level for both (A) HMP and (B) GEMS datasets.

References

    1. Altschul S. F., Gish W., Miller W., Myers E. W., Lipman D. J. (1990). Basic local alignment search tool. J. Mol. Biol. 215, 403–410. 10.1016/S0022-2836(05)80360-2 - DOI - PubMed
    1. Altschul S. F., Wootton J. C., Zaslavsky E., Yu Y.-K. (2010). The construction and use of log-odds substitution scores for multiple sequence alignment. PLoS Comput. Biol. 6, e1000852. 10.1371/journal.pcbi.1000852 - DOI - PMC - PubMed
    1. Barb J. J., Oler A. J., Kim H.-S., Chalmers N., Wallen G. R., Cashion A., et al. (2016). Development of an analysis pipeline characterizing multiple hypervariable regions of 16S rRNA using mock samples. PLoS One 11, e0148047. 10.1371/journal.pone.0148047 - DOI - PMC - PubMed
    1. Bhandari V., Ahmod N. Z., Shah H. N., Gupta R. S. (2013). Molecular signatures for Bacillus species: demarcation of the Bacillus subtilis and Bacillus cereus clades in molecular terms and proposal to limit the placement of new species into the genus Bacillus. Int. J. Syst. Evol. Microbiol. 63, 2712–2726. 10.1099/ijs.0.048488-0 - DOI - PubMed
    1. Blondel V. D., Guillaume J.-L., Lambiotte R., Lefebvre E. (2008). Fast unfolding of communities in large networks. J. Stat. Mech. 2008, P10008. 10.1088/1742-5468/2008/10/P10008 - DOI