Prediction of microbial phenotypes based on comparative genomics

Roman Feldbauer, Frederik Schulz, Matthias Horn, Thomas Rattei

PMID: 26451672
PMCID: PMC4603748
DOI: 10.1186/1471-2105-16-S14-S1

Comparative Study

Prediction of microbial phenotypes based on comparative genomics

Roman Feldbauer et al. BMC Bioinformatics. 2015.

. 2015;16 Suppl 14(Suppl 14):S1.

doi: 10.1186/1471-2105-16-S14-S1. Epub 2015 Oct 2.

Authors

Roman Feldbauer, Frederik Schulz, Matthias Horn, Thomas Rattei

PMID: 26451672
PMCID: PMC4603748
DOI: 10.1186/1471-2105-16-S14-S1

Abstract

The accessibility of almost complete genome sequences of uncultivable microbial species from metagenomes necessitates computational methods predicting microbial phenotypes solely based on genomic data. Here we investigate how comparative genomics can be utilized for the prediction of microbial phenotypes. The PICA framework facilitates application and comparison of different machine learning techniques for phenotypic trait prediction. We have improved and extended PICA's support vector machine plug-in and suggest its applicability to large-scale genome databases and incomplete genome sequences. We have demonstrated the stability of the predictive power for phenotypic traits, not perturbed by the rapid growth of genome databases. A new software tool facilitates the in-depth analysis of phenotype models, which associate expected and unexpected protein functions with particular traits. Most of the traits can be reliably predicted in only 60-70% complete genomes. We have established a new phenotypic model that predicts intracellular microorganisms. Thereby we could demonstrate that also independently evolved phenotypic traits, characterized by genome reduction, can be reliably predicted based on comparative genomics. Our results suggest that the extended PICA framework can be used to automatically annotate phenotypes in near-complete microbial genome sequences, as generated in large numbers in current metagenomics studies.

PubMed Disclaimer

Figures

**Figure 1**
**Phenotype prediction quality of different machine learning techniques**. Quality for ten exemplary traits measured as balanced accuracy in 10 replicate 5-fold cross-validations. Error bars indicate standard deviation.

**Figure 2**
**Effect of different machine learning techniques for phenotype prediction on run time**. Run time for cross-validations described in Fig. 1. This amounts to the combined time for training and testing 50 subsets of the complete data set (plus some overhead).

**Figure 3**
**Phenotype prediction quality for different SVM kernels measured as balanced accuracy in 10 replicate 5-fold cross-validation**. Kernel abbreviations: lin... linear, poly... polynomial, rbf... radial basis function, sigmoid. For each kernel, PICA standard parameters were used. Error bars indicate standard deviation.

**Figure 4**
**Computation run time and memory consumption per fold of 5-fold cross-validation of a virtual phenotype for increasing problem sizes**. Virtual species contain average bacteria-sized genomes. Problem dimensionality increases with the number of species to approximately 200,000 for 5000 species. Memory was measured as peak main memory necessary at the beginning of cross-validation (maxMemory) and average memory usage after the peak (avgMemory).

**Figure 5**
**Phenotype prediction performance for incomplete genomes**. Each point represents a cross-validation with training on complete genomes and testing on incomplete genomes. Incompleteness was simulated by random removal of x percent of all COGs in a genome. No values below 50% are observed as discussed in the main text. Error bars indicate standard deviation.

**Figure 6**
**Taxonomy of predicted obligate intracellular bacteria in eggNOG 4**.0. All species were considered whose genomes are flagged as complete based on 40 marker COGs. (+) marks indicate species also present in the obligate intracellular training set. None of the facultative intracellular or free-living species in the training set was predicted as obligate intracellular.

See this image and copyright information in PMC

References

1. Amann RI, Ludwig W, Schleifer KH. Phylogenetic identification and in situ detection of individual microbial cells without cultivation. Microbiol Rev. 1995;59(1):143–69. - PMC - PubMed
1. Franzosa EA. et al. Sequencing and beyond: integrating molecular 'omics' for microbial community profiling. Nat Rev Microbiol. 2015;13(6):360–72. doi: 10.1038/nrmicro3451. - DOI - PMC - PubMed
1. Callister SJ. et al. Analysis of biostimulated microbial communities from two field experiments reveals temporal and spatial differences in proteome profiles. Environ Sci Technol. 2010;44(23):8897–903. doi: 10.1021/es101029f. - DOI - PubMed
1. Albertsen M. et al. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat Biotechnol. 2013;31(6):533–8. doi: 10.1038/nbt.2579. - DOI - PubMed
1. Brown CT, Unusual biology across a group comprising more than 15% of domain Bacteria. Nature. 2015. - DOI - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Prediction of microbial phenotypes based on comparative genomics

Prediction of microbial phenotypes based on comparative genomics

Authors

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Molecular Biology Databases