Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Jul 1:16:205.
doi: 10.1186/s12859-015-0647-4.

Comparing K-mer based methods for improved classification of 16S sequences

Affiliations

Comparing K-mer based methods for improved classification of 16S sequences

Hilde Vinje et al. BMC Bioinformatics. .

Abstract

Background: The need for precise and stable taxonomic classification is highly relevant in modern microbiology. Parallel to the explosion in the amount of sequence data accessible, there has also been a shift in focus for classification methods. Previously, alignment-based methods were the most applicable tools. Now, methods based on counting K-mers by sliding windows are the most interesting classification approach with respect to both speed and accuracy. Here, we present a systematic comparison on five different K-mer based classification methods for the 16S rRNA gene. The methods differ from each other both in data usage and modelling strategies. We have based our study on the commonly known and well-used naïve Bayes classifier from the RDP project, and four other methods were implemented and tested on two different data sets, on full-length sequences as well as fragments of typical read-length.

Results: The difference in classification error obtained by the methods seemed to be small, but they were stable and for both data sets tested. The Preprocessed nearest-neighbour (PLSNN) method performed best for full-length 16S rRNA sequences, significantly better than the naïve Bayes RDP method. On fragmented sequences the naïve Bayes Multinomial method performed best, significantly better than all other methods. For both data sets explored, and on both full-length and fragmented sequences, all the five methods reached an error-plateau.

Conclusions: We conclude that no K-mer based method is universally best for classifying both full-length sequences and fragments (reads). All methods approach an error plateau indicating improved training data is needed to improve classification from here. Classification errors occur most frequent for genera with few sequences present. For improving the taxonomy and testing new classification methods, the need for a better and more universal and robust training data set is crucial.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
The size distibution histogram of genera in Trainingset9 and SilvaSet. The panels show the size distibution of genera in Trainingset9 and SilvaSet. Genera with more than 50 sequences are not included here, but in Trainingset9 there are 26 and in SilvaSet there are 40 such genera
Fig. 2
Fig. 2
Classification error for full-length 16S sequences. The top panels display the classification error for full-length sequences using all methods on word lengths K2−K8. The bottom panels are the same results only zoomed at the last three word lengths (K6−K8). Hence, the results are discrete values for every K-mer length and the connecting lines are merely to aid visual interpretation
Fig. 3
Fig. 3
Classification error for each test-set. For Trainingset9 classification error for all the five methods are displayed for each of the 10 different test-sets from the 10-fold cross validation for full-length sequences. The SilvaSet gave similar results. Hence, the results are discrete values for every K-mer length and the connecting lines are merely to aid visual interpretation
Fig. 4
Fig. 4
Classification error for fragments. The top panels display the classification error for sequence fragments using all methods on word lengths K2−K8. Sequences were split into 10 (partly overlapping) fragments of 200 bases, and all fragments were classified. The bottom panels are the same results only zoomed at the last three word lengths (K6−K8). Hence, the results are discrete values for every K-mer length and the connecting lines are merely to aid visual interpretation
Fig. 5
Fig. 5
The distribution of errors over methods. The Venn-diagram shows how errors distribute among the different methods. The number in each sector corresponds to the number of mis-classified sequences. The results are from Trainingset9 and full-length sequences
Fig. 6
Fig. 6
Error distribution over class-sizes. The horizontal axis is the class-size (number of sequences in a genus) and the vertical axis is the error percentage averaged over all genera of the same size. The upper panel is the result from Trainingset9 and the lower panel for the SilvaSet. Only class-sizes up to 10 is shown, for larger classes the error-percentages are small
Fig. 7
Fig. 7
Position specific error. The average error for each method on each of the 10 fragments. The fragments corresponds roughly to (partly overlapping) regions from the start to the end of each 16S sequence. These results are from Trainingset9, but the results from SilvaSet were similar. Hence, the results are discrete values for every K-mer length and the connecting lines are merely to aid visual interpretation

References

    1. Özlem Taştan Bishop. 2014. Bioinformatics and Data Analysis in Microbiology. Rhodes University, South Africa: Caister Academic Press.
    1. Woese CR, Stackebrand E, Macke TJ, Fox GE. A phylogenetic definition of the major eubacterial taxa. Syst Appl Microbiol. 1985;6:143–51. doi: 10.1016/S0723-2020(85)80047-3. - DOI - PubMed
    1. Woese CR. Bacterial evolution. Syst Appl Microbiol. 1987;51:221–71. - PMC - PubMed
    1. Caporaso JG, Lauber CL, Walters WA, Berg-Lyons D, Lozupone CA, Turnbaugh PJ, Fierer N, Knight R. Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proc Natl Acad Sci USA. 2011;108(Suppl 1):4516–22. doi: 10.1073/pnas.1000080107. - DOI - PMC - PubMed
    1. Claesson M, Wang Q, O’Sullivan O, Greene-Diniz R, Cole J, Ross R, et al. Comparison of two next-generation sequencing technologies for resolving highly complex microbiota composition using tandem variable 16S rRNA gene regions. Nucleic Acids Res. 2010;38:e200. doi: 10.1093/nar/gkq873. - DOI - PMC - PubMed

Publication types

Substances

LinkOut - more resources