Taxonomic classification of bacterial 16S rRNA genes using short sequencing reads: evaluation of effective study designs

Orna Mizrahi-Man¹, Emily R Davenport, Yoav Gilad

Affiliations

PMID: 23308262
PMCID: PMC3538547
DOI: 10.1371/journal.pone.0053608

Taxonomic classification of bacterial 16S rRNA genes using short sequencing reads: evaluation of effective study designs

Orna Mizrahi-Man et al. PLoS One. 2013.

. 2013;8(1):e53608.

doi: 10.1371/journal.pone.0053608. Epub 2013 Jan 7.

Authors

Orna Mizrahi-Man¹, Emily R Davenport, Yoav Gilad

Affiliation

¹ Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America.

PMID: 23308262
PMCID: PMC3538547
DOI: 10.1371/journal.pone.0053608

Abstract

Massively parallel high throughput sequencing technologies allow us to interrogate the microbial composition of biological samples at unprecedented resolution. The typical approach is to perform high-throughout sequencing of 16S rRNA genes, which are then taxonomically classified based on similarity to known sequences in existing databases. Current technologies cause a predicament though, because although they enable deep coverage of samples, they are limited in the length of sequence they can produce. As a result, high-throughout studies of microbial communities often do not sequence the entire 16S rRNA gene. The challenge is to obtain reliable representation of bacterial communities through taxonomic classification of short 16S rRNA gene sequences. In this study we explored properties of different study designs and developed specific recommendations for effective use of short-read sequencing technologies for the purpose of interrogating bacterial communities, with a focus on classification using naïve Bayesian classifiers. To assess precision and coverage of each design, we used a collection of ∼8,500 manually curated 16S rRNA gene sequences from cultured bacteria and a set of over one million bacterial 16S rRNA gene sequences retrieved from environmental samples, respectively. We also tested different configurations of taxonomic classification approaches using short read sequencing data, and provide recommendations for optimal choice of the relevant parameters. We conclude that with a judicious selection of the sequenced region and the corresponding choice of a suitable training set for taxonomic classification, it is possible to explore bacterial communities at great depth using current technologies, with only a minimal loss of taxonomic resolution.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. Performance of different training sets in the classification of 100 nt reads from the V4 amplicon.**
Each panel compares the performance of the training sets (described in Table 1) for a different rank. We used the results of leave-k-out tests classifying the LTP sequences to determine confidence score thresholds for a set of desired false prediction rate (FPR) values (x axis), so that the FPR would be at most the desired value. We then used these thresholds to calculate the classification coverage of sequences from environmental (uncultured) bacteria that corresponds to the desired FPR (y axis).

**Figure 2. Classification performance of different experimental designs.**
Each panel compares performance of different regions for a different combination of rank (genus or family) and sequencing strategy (100/120 nt single/paired-end reads). We used the results of leave-k-out tests classifying the LTP sequences to determine confidence score thresholds for a set of desired false prediction rate (FPR) values (x axis), so that the FPR would be at most the desired value (Tables S4, S5, S6, S7, S8, S9, and S10). We then used these thresholds to calculate the classification coverage of sequences from environmental (uncultured) bacteria that corresponds to the desired FPR (y axis). Figure S5 compares the performance of different regions across the same sequencing configurations for the ranks order, class, and phylum.

**Figure 3. Classification performance of combined 100 nt single-read predictions, as compared to the best performing paired-end configurations.**
We combined predictions made for different 100 nt fragments of the same sequence, by selecting the prediction with the highest confidence score at the genus level (or the lowest common level available). We evaluated the performance, at ranks genus and family (left and right panels, respectively), of combinations of fragments from the V3 and V4 regions (top and bottom panels, respectively) with fragments from each of the other regions examined, and compared it to the performance of the V3 and V4 100 nt paired-end configurations (pointed to by arrows). We used the results of leave-k-out tests classifying the LTP sequences to determine confidence score thresholds for a set of desired false prediction rate (FPR) values (x axis), so that the FPR would be at most the desired value. We then used these thresholds to calculate the classification coverage of sequences from environmental (uncultured) bacteria that corresponds to the desired FPR (y axis). Figure S6 compares the performance of the combinations for the ranks order, class, and phylum.

See this image and copyright information in PMC

References

1. Pace NR (1997) A molecular view of microbial diversity and the biosphere. Science 276: 734–740. - PubMed
1. Woese CR, Fox GE (1977) Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc Natl Acad Sci U S A 74: 5088–5090. - PMC - PubMed
1. Arumugam M, Raes J, Pelletier E, Le Paslier D, Yamada T, et al. (2011) Enterotypes of the human gut microbiome. Nature 473: 174–180. - PMC - PubMed
1. Costello EK, Lauber CL, Hamady M, Fierer N, Gordon JI, et al. (2009) Bacterial community variation in human body habitats across space and time. Science 326: 1694–1697. - PMC - PubMed
1. Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL, Duncan A, et al. (2009) A core gut microbiome in obese and lean twins. Nature 457: 480–484. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Taxonomic classification of bacterial 16S rRNA genes using short sequencing reads: evaluation of effective study designs

Affiliation

Taxonomic classification of bacterial 16S rRNA genes using short sequencing reads: evaluation of effective study designs

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases