Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2015;16 Suppl 14(Suppl 14):S1.
doi: 10.1186/1471-2105-16-S14-S1. Epub 2015 Oct 2.

Prediction of microbial phenotypes based on comparative genomics

Comparative Study

Prediction of microbial phenotypes based on comparative genomics

Roman Feldbauer et al. BMC Bioinformatics. 2015.

Abstract

The accessibility of almost complete genome sequences of uncultivable microbial species from metagenomes necessitates computational methods predicting microbial phenotypes solely based on genomic data. Here we investigate how comparative genomics can be utilized for the prediction of microbial phenotypes. The PICA framework facilitates application and comparison of different machine learning techniques for phenotypic trait prediction. We have improved and extended PICA's support vector machine plug-in and suggest its applicability to large-scale genome databases and incomplete genome sequences. We have demonstrated the stability of the predictive power for phenotypic traits, not perturbed by the rapid growth of genome databases. A new software tool facilitates the in-depth analysis of phenotype models, which associate expected and unexpected protein functions with particular traits. Most of the traits can be reliably predicted in only 60-70% complete genomes. We have established a new phenotypic model that predicts intracellular microorganisms. Thereby we could demonstrate that also independently evolved phenotypic traits, characterized by genome reduction, can be reliably predicted based on comparative genomics. Our results suggest that the extended PICA framework can be used to automatically annotate phenotypes in near-complete microbial genome sequences, as generated in large numbers in current metagenomics studies.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Phenotype prediction quality of different machine learning techniques. Quality for ten exemplary traits measured as balanced accuracy in 10 replicate 5-fold cross-validations. Error bars indicate standard deviation.
Figure 2
Figure 2
Effect of different machine learning techniques for phenotype prediction on run time. Run time for cross-validations described in Fig. 1. This amounts to the combined time for training and testing 50 subsets of the complete data set (plus some overhead).
Figure 3
Figure 3
Phenotype prediction quality for different SVM kernels measured as balanced accuracy in 10 replicate 5-fold cross-validation. Kernel abbreviations: lin... linear, poly... polynomial, rbf... radial basis function, sigmoid. For each kernel, PICA standard parameters were used. Error bars indicate standard deviation.
Figure 4
Figure 4
Computation run time and memory consumption per fold of 5-fold cross-validation of a virtual phenotype for increasing problem sizes. Virtual species contain average bacteria-sized genomes. Problem dimensionality increases with the number of species to approximately 200,000 for 5000 species. Memory was measured as peak main memory necessary at the beginning of cross-validation (maxMemory) and average memory usage after the peak (avgMemory).
Figure 5
Figure 5
Phenotype prediction performance for incomplete genomes. Each point represents a cross-validation with training on complete genomes and testing on incomplete genomes. Incompleteness was simulated by random removal of x percent of all COGs in a genome. No values below 50% are observed as discussed in the main text. Error bars indicate standard deviation.
Figure 6
Figure 6
Taxonomy of predicted obligate intracellular bacteria in eggNOG 4.0. All species were considered whose genomes are flagged as complete based on 40 marker COGs. (+) marks indicate species also present in the obligate intracellular training set. None of the facultative intracellular or free-living species in the training set was predicted as obligate intracellular.

References

    1. Amann RI, Ludwig W, Schleifer KH. Phylogenetic identification and in situ detection of individual microbial cells without cultivation. Microbiol Rev. 1995;59(1):143–69. - PMC - PubMed
    1. Franzosa EA. et al.Sequencing and beyond: integrating molecular 'omics' for microbial community profiling. Nat Rev Microbiol. 2015;13(6):360–72. doi: 10.1038/nrmicro3451. - DOI - PMC - PubMed
    1. Callister SJ. et al.Analysis of biostimulated microbial communities from two field experiments reveals temporal and spatial differences in proteome profiles. Environ Sci Technol. 2010;44(23):8897–903. doi: 10.1021/es101029f. - DOI - PubMed
    1. Albertsen M. et al.Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat Biotechnol. 2013;31(6):533–8. doi: 10.1038/nbt.2579. - DOI - PubMed
    1. Brown CT, Unusual biology across a group comprising more than 15% of domain Bacteria. Nature. 2015. - DOI - PubMed

Publication types