Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Apr 25:10:e13205.
doi: 10.7717/peerj.13205. eCollection 2022.

Inflammatory bowel disease biomarkers of human gut microbiota selected via different feature selection methods

Affiliations

Inflammatory bowel disease biomarkers of human gut microbiota selected via different feature selection methods

Burcu Bakir-Gungor et al. PeerJ. .

Abstract

The tremendous boost in next generation sequencing and in the "omics" technologies makes it possible to characterize the human gut microbiome-the collective genomes of the microbial community that reside in our gastrointestinal tract. Although some of these microorganisms are considered to be essential regulators of our immune system, the alteration of the complexity and eubiotic state of microbiota might promote autoimmune and inflammatory disorders such as diabetes, rheumatoid arthritis, Inflammatory bowel diseases (IBD), obesity, and carcinogenesis. IBD, comprising Crohn's disease and ulcerative colitis, is a gut-related, multifactorial disease with an unknown etiology. IBD presents defects in the detection and control of the gut microbiota, associated with unbalanced immune reactions, genetic mutations that confer susceptibility to the disease, and complex environmental conditions such as westernized lifestyle. Although some existing studies attempt to unveil the composition and functional capacity of the gut microbiome in relation to IBD diseases, a comprehensive picture of the gut microbiome in IBD patients is far from being complete. Due to the complexity of metagenomic studies, the applications of the state-of-the-art machine learning techniques became popular to address a wide range of questions in the field of metagenomic data analysis. In this regard, using IBD associated metagenomics dataset, this study utilizes both supervised and unsupervised machine learning algorithms, (i) to generate a classification model that aids IBD diagnosis, (ii) to discover IBD-associated biomarkers, (iii) to discover subgroups of IBD patients using k-means and hierarchical clustering approaches. To deal with the high dimensionality of features, we applied robust feature selection algorithms such as Conditional Mutual Information Maximization (CMIM), Fast Correlation Based Filter (FCBF), min redundancy max relevance (mRMR), Select K Best (SKB), Information Gain (IG) and Extreme Gradient Boosting (XGBoost). In our experiments with 100-fold Monte Carlo cross-validation (MCCV), XGBoost, IG, and SKB methods showed a considerable effect in terms of minimizing the microbiota used for the diagnosis of IBD and thus reducing the cost and time. We observed that compared to Decision Tree, Support Vector Machine, Logitboost, Adaboost, and stacking ensemble classifiers, our Random Forest classifier resulted in better performance measures for the classification of IBD. Our findings revealed potential microbiome-mediated mechanisms of IBD and these findings might be useful for the development of microbiome-based diagnostics.

Keywords: Biomarker discovery; Classification; Feature selection; Human gut microbiome; Metagenomics.

PubMed Disclaimer

Conflict of interest statement

Burcu Bakir-Gungor is an Academic Editor for PeerJ.

Figures

Figure 1
Figure 1. Illustration of the inflammatory bowel disease-associated metagenomics dataset.
Figure 2
Figure 2. Schematic representation of the methodology.
(i) Feature selection methods (shown in red) are applied to detect the most important species for the development of IBD (IBD-associated microorganisms), (ii) Using the selected features, models are constructed and used for classification (shown in blue), (iii) K-means clustering algorithm is applied on data to discover subgroups of IBD patients and control samples (shown in green).
Figure 3
Figure 3. Numbers of selected species using different feature selection algorithms and the numbers of intersecting species among different feature selection methods.
Figure 4
Figure 4. Performance evaluations of different classifiers on IBD metagenomics dataset, utilizing 100-fold Monte Carlo cross-validation and using (A) XGBoost, (B) Select K Best, and (C) Information Gain feature selection methods, (D) 14 selected features, (E) all features.
Figure 5
Figure 5. Comparative evaluation of different feature selection methods based on (A) Accuracy, (B) Area under ROC, and (C) F-Measure, using the Exploration Cohort dataset.
Figure 6
Figure 6. Two-dimensional t-SNE maps for (A) healthy sample subgroups, and (B) IBD patient subgroups, which are identified using K-means clustering.
Figure 7
Figure 7. Relative abundance values of the identified species in healthy and IBD subgroups.
Figure 8
Figure 8. Zoomed-in view of the relative abundance values for: (A) Bifidobacterium bifidum, (B) Porphyromonas asaccharolytica, (C) Eubacterium hallii, (D) Dorea formicigenerans, (E) Lachnospiraceae bacterium 1_1_57FAA, (F) Peptostreptococcus anaerobius in healthy subgroups and the IBD subgroups.
Figure 9
Figure 9. Principal component analysis of (A, C) all IBD-associated metagenomics data, (B, D) reduced dataset that includes features for the 14 selected species, shown in 3D in (A, B) and in 2D in (C, D).
Interactive 3D plots are provided as a supplementary material.
Figure 10
Figure 10. Hierarchical clustering of the samples, based on the relative amounts of the 14 selected species.
The side bar on the left hand side indicates class labels: IBD patients and healthy samples are shown in red and blue, respectively. In the heatmap, the colors represent raw z-scores. While the black color indicates relative abundance values just around the mean, the lighter colors denote the relative abundance values of 1 to 4 standard deviations above the mean. The areas that are restricted with red boxes suggest differential relative abundance values for the corresponding species in the associated subgroup.

References

    1. Aden K, Reindl W. The gut microbiome in inflammatory bowel diseases: diagnostic and therapeutic implications. Visceral Medicine. 2019;35(6):332–337. doi: 10.1159/000504148. - DOI - PMC - PubMed
    1. Aldars-García L, Chaparro M, Gisbert JP. Systematic review: the gut microbiome and its potential clinical application in inflammatory bowel disease. Microorganisms. 2021;9(5):977. doi: 10.3390/microorganisms9050977. - DOI - PMC - PubMed
    1. Aldars-García L, Marin AC, Chaparro M, Gisbert JP. The interplay between immune system and microbiota in inflammatory bowel disease: a narrative review. International Journal of Molecular Sciences. 2021;22(6):3076. doi: 10.3390/ijms22063076. - DOI - PMC - PubMed
    1. Armour CR, Nayfach S, Pollard KS, Sharpton TJ. A metagenomic meta-analysis reveals functional signatures of health and disease in the human gut microbiome. MSystems. 2019;4(4):e00332-18. doi: 10.1128/mSystems.00332-18. - DOI - PMC - PubMed
    1. Bakir-Gungor B, Bulut O, Jabeer A, Nalbantoglu OU, Yousef M. Discovering potential taxonomic biomarkers of type 2 diabetes from human gut microbiota via different feature selection methods. Frontiers in Microbiology. 2021;12:628426. doi: 10.3389/fmicb.2021.628426. - DOI - PMC - PubMed

Publication types

MeSH terms