Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jun 27;19(1):82.
doi: 10.1186/s13059-018-1450-0.

HmmUFOtu: An HMM and phylogenetic placement based ultra-fast taxonomic assignment and OTU picking tool for microbiome amplicon sequencing studies

Affiliations

HmmUFOtu: An HMM and phylogenetic placement based ultra-fast taxonomic assignment and OTU picking tool for microbiome amplicon sequencing studies

Qi Zheng et al. Genome Biol. .

Abstract

Culture-independent analysis of microbial communities frequently relies on amplification and sequencing of the prokaryotic 16S ribosomal RNA gene. Typical analysis pipelines group sequences into operational taxonomic units (OTUs) to infer taxonomic and phylogenetic relationships. Here, we present HmmUFOtu, a novel tool for processing microbiome amplicon sequencing data, which performs rapid per-read phylogenetic placement, followed by phylogenetically informed clustering into OTUs and taxonomy assignment. Compared to standard pipelines, HmmUFOtu more accurately and reliably recapitulates microbial community diversity and composition in simulated and real datasets without relying on heuristics or sacrificing speed or accuracy.

Keywords: 16S rRNA gene; DNA substitution models; Dirichlet models; FM-index; HMM profile alignment; Microbiome; Operational taxonomic unit; Phylogenetic placement; Taxonomic assignment.

PubMed Disclaimer

Conflict of interest statement

Authors’ information

Jacquelyn S. Meisel, PhD is now a Postdoctoral Associate at the Institute of Advanced Computer Studies (UMIACS) at the University of Maryland College Park. The new contact email for Dr. Meisel is meiselj@umiacs.umd.edu.

Ethics approval and consent to participate

None of the authors have any ethics or consent issues to declare for the experimental data used in this study. HmmUFOtu is licensed under the GNU General Public License v3.0.

Consent for publication

All authors have consented to the publication of this work in Genome Biology.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
General workflow of HmmUFOtu and the default OTU-based QIIME pipeline for 16S rRNA gene sequencing studies. a Main steps of the default QIIME pipeline include: (1) generating OTUs (OTU picking); (2) selecting an individual read for each OTU as the representative sequence (rep-seq picking); (3) assigning taxonomic information to every OTU by comparing the rep-seq to the reference database (taxonomic assignment); and optionally (4) aligning rep-seqs to the references; (5) constructing a de novo OTU tree using aligned rep-seqs. b Main steps of HmmUFOtu include: (1) per-read alignment and taxonomic assignment with profile-HMM and phylogenetic placement algorithms; (2) OTU picking around existing phylogenetic nodes to generate phylogeny-based OTUs, consensus based rep-seqs, and reference-based OTU tree. Dashed circles: phylogeny-based OTUs; gray dashed lines and dots: unneeded subtrees of the reference tree that are pruned to generate the OTU tree
Fig. 2
Fig. 2
HmmUFOtu core algorithms. a Constructing a consensus sequence FM-index (CSFM-index) from a MSA using the Burrows-Wheeler transform (BWT) coupled with Wavelet-tree compression algorithms. Red: Actual stored data in a CSFM-index. b A “plan 7” (p7) HMM architecture specifically designed for 16S rRNA gene and other target gene/marker sequencing, with M (match), I (insertion), D (deletion), N (N′: 5′), C (C′: 3′), B (begin), and E (end) states, respectively. Dashed circles and arrows: “wing-retraction” process used to avoid empty alignment paths; red arrows: special transitions used to control the “global” or “local” alignment mode. c Banded-HMM Viterbi algorithm to find the most likely (minimum cost) path given the HMM profile (row), a read sequence (column), and two known “seed” paths by querying the CSFM-index. Only shaded grids are searched by the banded-Viterbi algorithm. The first and last shaded search areas rarely reach the profile ends. d An example of a 16S rRNA phylogenetic tree. In this tree, all directional conditional log-likelihoods (arrows in (e), (f), (g)) of all branches were pre-evaluated. The ancestral sequences of all internal nodes were inferred using maximum likelihood. e For a potential “seed” branch u--v, a small sub-tree containing only nodes u, v, the original conditional log-likelihoods L(u) and L(v) and original branch length w0 are copied. f To place a new read n to branch u--v, a new internal node r is introduced, the new conditional log-likelihoods L(n) are evaluated, then initial branch lengths wrv, wur, and wnr are estimated using observed distance (p-Dist). g For a candidate top estimation, the branch lengths wrv, wur, wnr and L(rv), L(ru), and L(rn) are iteratively and jointly optimized until convergence
Fig. 3
Fig. 3
Precision-recall (PR) curves for HmmUFOtu taxonomic assignment results on four simulated datasets: (a) random; (b) V4; (c) V1 V3; (d) V3 V5. True positive (TP) and true negative (TN) are defined as both the known and assigned taxonomy having or not having a certain level of taxonomy annotation, respectively. PR curves are calculated by varying the assignment Q-score threshold from 0 to 10 with a step of 1, then 20, 30, 40, 50, 60, and 250 (the maximum value)
Fig. 4
Fig. 4
Comparison of taxonomic assignment accuracy between HmmUFOtu and uclust, the QIIME-default OTU picking strategy, using the “gg_97_otus_GTR” database. The height of the bars reflects the assignment accuracy of four simulated datasets at different taxonomic levels using HmmUFOtu (a) or QIIME-default method (b, c). a, b Accuracy measured at per-read level. c Accuracy measured at per-OTU level
Fig. 5
Fig. 5
Comparison of inferred bacterial community structures between HmmUFOtu and QIIME-default methods using V4 and V1 V3 mock community datasets. Both mock datasets contain ten replicates sequenced across ten different Illumina MiSeq runs. a, b Inferred and theoretical () mock community compositions for V4 and V1 V3 datasets, respectively, calculated using HmmUFOtu or QIIME-default generated OTU tables. Bars: replicate samples; assembled: pre-processed paired-end merged reads; paired: un-processed paired-end reads. c Community structure dissimilarity between the inferred and reference community structure as calculated by the Bray-Curtis beta-diversity metric. The median is represented by the line in the box, hinges represent the first and third quartiles, whiskers represent 1.5 times the interquartile range, and dots represent outlying data points. d Alpha-diversity of the mock community measured by the inferred number of observed species. Box plots are as above. Left panels: V4 datasets; right panels: V1 V3 datasets
Fig. 6
Fig. 6
Comparing the quality of rep-seqs between HmmUFOtu’s consensus-based and QIIME’s single-sequence based rep-seq picking methods. For HmmUFOtu: the consensus sequences with (default) or without the priors were tested; for QIIME: the “first,” “longest,” and “random” methods were tested. a Mock datasets, in which the quality is reflected by the %identity between the rep-seqs and the known bacterial reference genomes. b HMP datasets, in which the quality is reflected by the %identity between the rep-seqs and the de novo assembled scaffolds from the WGS data sequenced in the same samples. LAH left auriculotemporal part of head, RAH right auriculotemporal part of head
Fig. 7
Fig. 7
Benchmarking results from chimeric read detection using the “segment placement comparison” algorithm from HmmUFOtu. a Receiver operating characteristic (ROC) curves for detecting simulated chimeric reads from random in silico cross-over events using GreenGenes 97% OTU reference sequences; 10,000 chimeric or non-chimeric simulated reads were used in the respective range for each p-distance subset. ROC curves are calculated by varying the min LOD cut-off from 0 to 10 with a step of 1, then 20 to 100 with a step of 10. b Estimated proportion of chimeric reads in all of the benchmarked real datasets (mock and HMP) by enabling HmmUFOtu’s chimera detection and setting the LOD cut-off at 50. c Differences in “observed species” alpha diversity of the mock community V1 V3 dataset. Paired: original results using raw paired reads; Nonchimera: chimera-filtered results using the same paired reads
Fig. 8
Fig. 8
Multi-threading performance of HmmUFOtu benchmarked on four simulated datasets on 1–16 threads. a Relative processing speed (reads per second) normalized to 1-thread results. b Average CPU usage (%). c Maximum RAM (memory) usage in GB. All results are based on the average of 20 replicate samples

References

    1. Schloss PD, Westcott SL. Assessing and improving methods used in operational taxonomic unit-based approaches for 16S rRNA gene sequence analysis. Appl Environ Microbiol. 2011;77:3219–3226. doi: 10.1128/AEM.02810-10. - DOI - PMC - PubMed
    1. Schloss PD. The effects of alignment quality, distance calculation method, sequence filtering, and region on the analysis of 16S rRNA gene-based studies. PLoS Comput Biol. 2010;6:e1000844. doi: 10.1371/journal.pcbi.1000844. - DOI - PMC - PubMed
    1. Kim M, Morrison M, Yu Z. Evaluation of different partial 16S rRNA gene sequence regions for phylogenetic analysis of microbiomes. J Microbiol Methods. 2011;84:81–87. doi: 10.1016/j.mimet.2010.10.020. - DOI - PubMed
    1. Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22:1658–1659. doi: 10.1093/bioinformatics/btl158. - DOI - PubMed
    1. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, et al. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol. 2009;75:7537–7541. doi: 10.1128/AEM.01541-09. - DOI - PMC - PubMed

Publication types

Substances

LinkOut - more resources