Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Jan 30:11:69.
doi: 10.1186/1471-2105-11-69.

From learning taxonomies to phylogenetic learning: integration of 16S rRNA gene data into FAME-based bacterial classification

Affiliations

From learning taxonomies to phylogenetic learning: integration of 16S rRNA gene data into FAME-based bacterial classification

Bram Slabbinck et al. BMC Bioinformatics. .

Abstract

Background: Machine learning techniques have shown to improve bacterial species classification based on fatty acid methyl ester (FAME) data. Nonetheless, FAME analysis has a limited resolution for discrimination of bacteria at the species level. In this paper, we approach the species classification problem from a taxonomic point of view. Such a taxonomy or tree is typically obtained by applying clustering algorithms on FAME data or on 16S rRNA gene data. The knowledge gained from the tree can then be used to evaluate FAME-based classifiers, resulting in a novel framework for bacterial species classification.

Results: In view of learning in a taxonomic framework, we consider two types of trees. First, a FAME tree is constructed with a supervised divisive clustering algorithm. Subsequently, based on 16S rRNA gene sequence analysis, phylogenetic trees are inferred by the NJ and UPGMA methods. In this second approach, the species classification problem is based on the combination of two different types of data. Herein, 16S rRNA gene sequence data is used for phylogenetic tree inference and the corresponding binary tree splits are learned based on FAME data. We call this learning approach 'phylogenetic learning'. Supervised Random Forest models are developed to train the classification tasks in a stratified cross-validation setting. In this way, better classification results are obtained for species that are typically hard to distinguish by a single or flat multi-class classification model.

Conclusions: FAME-based bacterial species classification is successfully evaluated in a taxonomic framework. Although the proposed approach does not improve the overall accuracy compared to flat multi-class classification, it has some distinct advantages. First, it has better capabilities for distinguishing species on which flat multi-class classification fails. Secondly, the hierarchical classification structure allows to easily evaluate and visualize the resolution of FAME data for the discrimination of bacterial species. Summarized, by phylogenetic learning we are able to situate and evaluate FAME-based bacterial species classification in a more informative context.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Divisive clustering tree. Phylogenetic tree resulting from the divisive clustering of the FAME data of 15 Bacillus species based on classification by Random Forests. Clustering is based on AUC and average linkage of the probability estimates calculated from identification by Random Forests. At the different nodes the corresponding AUC value is reported. The Bacillus cereus and Bacillus subtilis groups are coloured in blue and green, respectively.
Figure 2
Figure 2
Sensitivity and F-score values by phylogenetic learning based on a 16S rRNA gene NJ tree. For each Bacillus species, the corresponding sensitivity and F-score value of phylogenetic learning based on a 16S rRNA gene NJ tree is displayed. Sensitivity is indicated by the light blue bars, F-score by the green bars. The tree is visualized using the iTol webtool [40]. The Bacillus cereus and Bacillus subtilis groups are coloured in blue and green, respectively.
Figure 3
Figure 3
Sensitivity and F-score values by phylogenetic learning based on a 16S rRNA gene UPGMA tree. For each Bacillus species, the corresponding sensitivity and F-score value of phylogenetic learning based on a 16S rRNA gene UPGMA tree is displayed. Sensitivity is indicated by the light blue bars, F-score by the green bars. The tree is visualized using the iTol webtool [40]. The Bacillus cereus and Bacillus subtilis groups are coloured in blue and green, respectively.
Figure 4
Figure 4
Sensitivity and F-score values for flat multi-class classification. For each Bacillus species, the corresponding sensitivity and F-score value of flat multi-class classification is displayed along the 16S rRNA gene NJ tree. Sensitivity is indicated by the light blue bars, F-score by the green bars. The tree is visualized using the iTol webtool [40]. The Bacillus cereus and Bacillus subtilis groups are coloured in blue and green, respectively.
Figure 5
Figure 5
Comparison of performance at class level. For each class, sensitivity and F-score values resulting from phylogenetic learning based on a 16S rRNA gene NJ or UPGMA tree are compared to those obtained by flat multi-class classification. Four plots are given. The X-axis corresponds to thresholds set on the corresponding metric values. Threshold steps of 0.01 are chosen. For each threshold, flat multi-class classification is evaluated at class level and those classes with metric values smaller than or equal to the threshold are selected. Classification performance by phylogenetic learning is analyzed at class level for each set of classes. The Y-axis on the left projects the number of phylogenetic learning classes that have a higher metric value than those obtained by flat multi-class classification. The red line expresses this number, relative to the size of the corresponding set (Y-axis on the right).
Figure 6
Figure 6
Average misclassification depth of phylogenetic learning based on a 16S rRNA gene NJ tree. The average depth of the misclassified test profiles of each species is visualized for phylogenetic learning based on a 16S rRNA gene NJ tree. Depth equals the number of nodes along the classification path until misclassification occurs (the corresponding node also included) and corresponds to the green bars. The maximum or correct depth is shown by the red bars. Maximum depth equals the number of nodes along the true phylogenetic path (leaf included).

Similar articles

Cited by

References

    1. Dawyndt P, Vancanneyt M, Snauwaert C, De Baets B, De Meyer H, Swings J. Mining fatty acid databases for detection of novel compounds in aerobic bacteria. Journal of Microbiological Methods. 2006;66(3):410–433. doi: 10.1016/j.mimet.2006.01.008. - DOI - PubMed
    1. Kunitsky C, Osterhout G, Sasser M. In: Encyclopedia of Rapid Microbiological Methods. Miller M, editor. Vol. 3. Bethesda: PDA; 2006. Identification of microorganisms using fatty acid methyl ester (FAME) analysis and the MIDI Sherlock Microbial Identification System; pp. 1–18.
    1. Slabbinck B, Gillis W, Dawyndt P, De Vos P, De Baets B. FAME-bank.net: a public database for bacterial FAME profiles. http://www.fame-bank.net
    1. Slabbinck B, De Baets B, Dawyndt P, De Vos P. Genus-wide Bacillus species identification through proper artificial neural network experiments on fatty acid profiles. Antonie van Leeuwenhoek International Journal of General and Molecular Microbiology. 2008;94(2):187–198. doi: 10.1007/s10482-008-9229-z. - DOI - PubMed
    1. Slabbinck B, De Baets B, Dawyndt P, De Vos P. Towards large-scale FAME-based bacterial species identification using machine learning techniques. Systematic and Applied Microbiology. 2009;32(3):163–176. doi: 10.1016/j.syapm.2009.01.003. - DOI - PubMed

Publication types

LinkOut - more resources