. 2012;7(3):e32491.

doi: 10.1371/journal.pone.0032491. Epub 2012 Mar 5.

Using the RDP classifier to predict taxonomic novelty and reduce the search space for finding novel organisms

Yemin Lan¹, Qiong Wang, James R Cole, Gail L Rosen

Affiliations

PMID: 22403664
PMCID: PMC3293824
DOI: 10.1371/journal.pone.0032491

Using the RDP classifier to predict taxonomic novelty and reduce the search space for finding novel organisms

Yemin Lan et al. PLoS One. 2012.

. 2012;7(3):e32491.

doi: 10.1371/journal.pone.0032491. Epub 2012 Mar 5.

Authors

Yemin Lan¹, Qiong Wang, James R Cole, Gail L Rosen

Affiliation

¹ School of Biomedical Engineering, Science, and Health Systems, Drexel University, Philadelphia, Pennsylvania, United States of America.

PMID: 22403664
PMCID: PMC3293824
DOI: 10.1371/journal.pone.0032491

Abstract

Background: Currently, the naïve Bayesian classifier provided by the Ribosomal Database Project (RDP) is one of the most widely used tools to classify 16S rRNA sequences, mainly collected from environmental samples. We show that RDP has 97+% assignment accuracy and is fast for 250 bp and longer reads when the read originates from a taxon known to the database. Because most environmental samples will contain organisms from taxa whose 16S rRNA genes have not been previously sequenced, we aim to benchmark how well the RDP classifier and other competing methods can discriminate these novel taxa from known taxa.

Principal findings: Because each fragment is assigned a score (containing likelihood or confidence information such as the boostrap score in the RDP classifier), we "train" a threshold to discriminate between novel and known organisms and observe its performance on a test set. The threshold that we determine tends to be conservative (low sensitivity but high specificity) for naïve Bayesian methods. Nonetheless, our method performs better with the RDP classifier than the other methods tested, measured by the f-measure and the area-under-the-curve on the receiver operating characteristic of the test set. By constraining the database to well-represented genera, sensitivity improves 3-15%. Finally, we show that the detector is a good predictor to determine novel abundant taxa (especially for finer levels of taxonomy where novelty is more likely to be present).

Conclusions: We conclude that selecting a read-length appropriate RDP bootstrap score can significantly reduce the search space for identifying novel genera and higher levels in taxonomy. In addition, having a well-represented database significantly improves performance while having genera that are "highly" similar does not make a significant improvement. On a real dataset from an Amazon Terra Preta soil sample, we show that the detector can predict (or correlates to) whether novel sequences will be assigned to new taxa when the RDP database "doubles" in the future.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. The number of sequences per genera (log-scale) demonstrating the imbalance of the database.**

**Figure 2. Setup for the “half-fold” experiments where half the sequences were used for training and half for testing.**

**Figure 3. Illustrating how **Figure 2** relates to the overall detector development and testing for each method.**

**Figure 4. The ROC curve for 4 different novel/known detection methods using the 500 bp read test dataset at the genus-level.**
The naïve Bayesian methods perform better (higher AUC) than Phymm(BL). The threshold (f-measure) determined chosen from the training data is shown with a blue dot.

**Figure 5. The ROC curve for 4 different methods for 250 bp reads on the genus-level.**
RDP obtains the best AUC followed by NBC, PhymmBL, and Phymm. The blue star indicates the threshold determined from the training data. In this case, for PhymmBL, the training data determination of the threshold resulted in the most optimal point for the test set unlike the other methods. This results in PhymmBL's good performance in Fig. 8 (bar graph for 250 bp).

**Figure 6. The ROC curve for 4 different methods for 100 bp reads on the genus-level.**
RDP obtains the best AUC followed by NBC, PhymmBL, and Phymm. The blue star indicates the threshold determined from the training data. In this case, for NBC, the training data determination of the threshold resulted in the most optimal point for the test set unlike the other methods. This results in NBC's good performance in Fig. 16{9} (bar graph for 100 bp).

**Figure 7. The sensitivity, specificity, and f-measure comparison of novel/known detection of the 500 bp read test dataset on the genus-level.**
RDP's bootstrap performs the best for being able to discriminate between reads from known and novel origin, with around 76% for the combined f-measure.

**Figure 8. The sensitivity, specificity, and f-measure comparison of 250 bp reads on the genus-level.**
The naïve Bayesian methods and the hybrid PhymmBL (with empirically chosen threshold) have the best f-measure while Phymm and PhymmBL's built-in confidence measures do not do that well. PhylOTU discarded 789 reads out of 16804 reads. Only those classified are calculated in our performance metric.

**Figure 9. The sensitivity, specificity, and f-measure comparison of 100 bp reads on the genus-level.**
The Naïve Bayesian methods and the hybrid PhymmBL (with empirically chosen threshold) have the best f-measure while Phymm and PhymmBL's built-in confidence measures do not perform well overall. PhylOTU had memory errors when placing the ∼6400 reads in the test dataset and therefore, there is no performance metric here.

**Figure 10. The Receiver-Operating Characteristic Curves for RDP on various taxonomic ranks for 500 bp reads.**
The Phylum-level has almost perfect performance (maximized TPR while minimized FPR). Surprisingly, family and order have slightly lower AUC than genus, but this is most likely due to taxonomic anomalies at these levels. Using the threshold derived on the training data, the performance on the test data is shown with a blue star.

**Figure 11. The ROC curve for RDP on all levels for 250 bp reads.**
Again, the phylum level has high sensitivity at very high specificity.

**Figure 12. The ROC curve for RDP on all levels for 100 bp reads.**
Performance decreases for all levels compared to the 500 bp reads but the area-under-the-curves are still over 75%.

Figure 13. Comparison of 500 bp read performance on databases composed of genera that have at least 10 example sequences (well-represented genera) and genera which have highly similar sequences (Genera with one CD-HIT cluster).
While the database with the highly similar genera has about the same performance as the original, the database with the well-represented genera performs about 10% better, in terms of f-measure.

Figure 14. Comparison of 250 bp read performance on databases composed of genera that have at least 10 example sequences (well-represented genera) and genera which have highly similar sequences (Genera with one CD-HIT cluster).

Figure 15. Comparison of 100 bp read performance on databases composed of genera that have at least 10 example sequences (well-represented genera) and genera which have highly similar sequences (Genera with one CD-HIT cluster).
The optimal detection threshold determined on the training dataset is shown with a blue star.

**Figure 16. Calculation of the correlation between 1) the detector prediction using the “present” and 2) the full, “future” database.**
The percent change in the taxon bin is correlated to the previous prediction of novelty of the reads in that bin.

Figure 17. Pearson correlation coefficient of the decrease predicted by the detector (from the half-database) to the change in abundance in that taxa with the full, updated database for an amazon soil pyrosequence dataset.

See this image and copyright information in PMC

References

1. Pace B, Stahl DA, Pace NR. The catalytic element of a ribosomal RNA-processing complex. J Biol Chem. 1984;259:11454–11458. - PubMed
1. Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, et al. Microbial diversity in the deep sea and the underexplored “rare biosphere”. Proc Natl Acad Sci USA. 2006;103(32):12115–12120. - PMC - PubMed
1. Lazarevic V, Whiteson K, Huse S, Hernandez D, Farinelli L, et al. Metagenomic study of the oral microbiota by Illumina high-throughput sequencing, Journal of Microbiological Methods. 2009;79(3):266–271. - PMC - PubMed
1. Jones RT, Robeson MS, Lauber CL, Hamady M, Knight R, et al. A comprehensive survey of soil acidobacterial diversity using pyrosequencing and clone library analyses, ISME J. 2009;3(4):442–53. - PMC - PubMed
1. Galand PE, Casamayor EO, Kirchman DL, Lovejoy C. Ecology of the rare microbial biosphere of the Arctic Ocean, Proc Natl Acad Sci USA. 2009;106(52):22427–22432. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Using the RDP classifier to predict taxonomic novelty and reduce the search space for finding novel organisms

Affiliation

Using the RDP classifier to predict taxonomic novelty and reduce the search space for finding novel organisms

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials