Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jul 1;34(13):i32-i42.
doi: 10.1093/bioinformatics/bty296.

MicroPheno: predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples

Affiliations

MicroPheno: predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples

Ehsaneddin Asgari et al. Bioinformatics. .

Erratum in

Abstract

Motivation: Microbial communities play important roles in the function and maintenance of various biosystems, ranging from the human body to the environment. A major challenge in microbiome research is the classification of microbial communities of different environments or host phenotypes. The most common and cost-effective approach for such studies to date is 16S rRNA gene sequencing. Recent falls in sequencing costs have increased the demand for simple, efficient and accurate methods for rapid detection or diagnosis with proved applications in medicine, agriculture and forensic science. We describe a reference- and alignment-free approach for predicting environments and host phenotypes from 16S rRNA gene sequencing based on k-mer representations that benefits from a bootstrapping framework for investigating the sufficiency of shallow sub-samples. Deep learning methods as well as classical approaches were explored for predicting environments and host phenotypes.

Results: A k-mer distribution of shallow sub-samples outperformed Operational Taxonomic Unit (OTU) features in the tasks of body-site identification and Crohn's disease prediction. Aside from being more accurate, using k-mer features in shallow sub-samples allows (i) skipping computationally costly sequence alignments required in OTU-picking and (ii) provided a proof of concept for the sufficiency of shallow and short-length 16S rRNA sequencing for phenotype prediction. In addition, k-mer features predicted representative 16S rRNA gene sequences of 18 ecological environments, and 5 organismal environments with high macro-F1 scores of 0.88 and 0.87. For large datasets, deep learning outperformed classical methods such as Random Forest and Support Vector Machine.

Availability and implementation: The software and datasets are available at https://llp.berkeley.edu/micropheno.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
The components and the data flow in the MicroPheno computational workflow
Fig. 2.
Fig. 2.
General architecture of the MLP neural networks that have been used in this study for multi-class classification of environment and host phenotypes
Fig. 3.
Fig. 3.
Steps we take to explore parameters for the representations, and how we choose the classifier for prediction of the phenotype of interest in this study
Fig. 4.
Fig. 4.
Measuring (i) self-inconsistency (DS¯) and (ii) unrepresentativeness (DR¯) for the body-site dataset. Each point represents an average of 100 resamples belonging to 10 randomly selected 16S rRNA samples. Higher k-values require higher sampling rates to produce self-consistent and representative samples
Fig. 5.
Fig. 5.
The confusion matrix for the classification of five major body-sites, using Random Forest classifier in a 10×fold cross-validation scheme. The presented body-sites are saliva (o: oral), mid-vagina (u: urogenital), anterior nares (n: nasal), stool (g: gut), and posterior fornix (u: urogenital)
Fig. 6.
Fig. 6.
Visualization of (a) body-site, (b) Crohn’s disease, (c) ecological environments datasets using different projection methods: (i) PCA over 6-mer distributions with unsupervised training, (ii) t-SNE over 6-mer distributions with unsupervised training, (iii) visualization of the activation function of the last layer of the trained neural network (projected to 2D using t-SNE). (a) Visualization of the body-site dataset. (b) Visualization of the Crohn’s disease dataset. (c) Visualization of 18 ecological microbial environments

References

    1. Ann Moran M. (2015) The global ocean microbiome. Science, 350, doi:10.1126/science.aac8455. - PubMed
    1. Armbrust E.V. et al. (2015) Structure and function of the global ocean microbiome. Science, 348, 865. - PubMed
    1. Arrieta M.-C. et al. (2015) Early infancy microbial and metabolic alterations affect risk of childhood asthma. Sci. Transl. Med., 7, doi: 10.1126/scitranslmed.aab2271. - PubMed
    1. Asgari E., Mofrad M.R.K. (2015) Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS One, 10, e0141287. - PMC - PubMed
    1. Bengio Y. et al. (2013) Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35, 1798. - PubMed

Substances