Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning

doi:10.3389/fmicb.2022.886201

. 2022 Jul 19:13:886201.

doi: 10.3389/fmicb.2022.886201. eCollection 2022.

Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning

Celia Díez López¹, Diego Montiel González¹, Athina Vidaki¹, Manfred Kayser¹

Affiliations

PMID: 35928158
PMCID: PMC9343866
DOI: 10.3389/fmicb.2022.886201

Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning

Celia Díez López et al. Front Microbiol. 2022.

. 2022 Jul 19:13:886201.

doi: 10.3389/fmicb.2022.886201. eCollection 2022.

Authors

Celia Díez López¹, Diego Montiel González¹, Athina Vidaki¹, Manfred Kayser¹

Affiliation

¹ Department of Genetic Identification, Erasmus MC University Medical Center Rotterdam, Rotterdam, Netherlands.

PMID: 35928158
PMCID: PMC9343866
DOI: 10.3389/fmicb.2022.886201

Abstract

Human microbiome research is moving from characterization and association studies to translational applications in medical research, clinical diagnostics, and others. One of these applications is the prediction of human traits, where machine learning (ML) methods are often employed, but face practical challenges. Class imbalance in available microbiome data is one of the major problems, which, if unaccounted for, leads to spurious prediction accuracies and limits the classifier's generalization. Here, we investigated the predictability of smoking habits from class-imbalanced saliva microbiome data by combining data augmentation techniques to account for class imbalance with ML methods for prediction. We collected publicly available saliva 16S rRNA gene sequencing data and smoking habit metadata demonstrating a serious class imbalance problem, i.e., 175 current vs. 1,070 non-current smokers. Three data augmentation techniques (synthetic minority over-sampling technique, adaptive synthetic, and tree-based associative data augmentation) were applied together with seven ML methods: logistic regression, k-nearest neighbors, support vector machine with linear and radial kernels, decision trees, random forest, and extreme gradient boosting. K-fold nested cross-validation was used with the different augmented data types and baseline non-augmented data to validate the prediction outcome. Combining data augmentation with ML generally outperformed baseline methods in our dataset. The final prediction model combined tree-based associative data augmentation and support vector machine with linear kernel, and achieved a classification performance expressed as Matthews correlation coefficient of 0.36 and AUC of 0.81. Our method successfully addresses the problem of class imbalance in microbiome data for reliable prediction of smoking habits.

Keywords: class imbalance; data augmentation; human microbiome; machine learning; prediction modeling; saliva microbiome; smoking status; trait prediction.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**Figure 1**
Overview of the study's analytical strategy. **(A–C)** The original dataset was split into a training set (80%) (purple box in B) and a holdout test set (20%) (red box in C) by maintaining the original ratio between classes in the partitions. Data augmentation techniques were applied to the training set, making a total of six different input data types (d = 6), including baseline non-augmented and differently augmented data types. **(D)** For the nested cross-validation (nCV) approach, the training set was split into five outer k-folds of training (80%) (orange box in D) and test (20%) (blue box in D) sets each. **(E)** Each outer k-fold was split into two inner n-folds of training (50%) and validation (50%) sets (orange box in E) in which seven different machine learning (ML) models (m = 7) were optimized and validated (inner models). **(F)** The best-performing n-fold inner model (green box in F) was applied to the corresponding k-fold test set (green arrow to blue box in F). **(G)** For each k-fold test set, two performance metrics were obtained: Matthews correlation coefficient (MCC) and area under the receiver operating characteristic curve (AUC). Repetition of steps **(D)** to **(G)** for all the input data types (d = 6) with ML method (m = 7) (total of 42 different approaches). **(H)** Repetition of steps **(A)** to **(G)** 10 times (i = 10) to control for introduced variation by data partitions. **(I)** Selection of the best-performing data type with ML method based on MCC metric and training on full final 80% training set to create the final prediction model. **(J)** Validation of final prediction model on final 20% holdout test set.

**Figure 2**
Validation of data types with machine learning (ML) methods for microbiome-based prediction of smoking habits based on the S1 and S2 datasets together. For each ML method, we evaluated six types of input data: baseline non-augmented and five augmented datasets based on different methods (ADASYN-1, ADASYN-2, SMOTE-1, SMOTE-2, and TADA). **(A)** Matthews correlation coefficient (MCC) and **(B)** area under the receiver operating characteristic curve (AUC) values from the 5-fold nested crossed-validation were repeated for 10 times (5 ^* 10). For MCC, +1 represents a perfect prediction, 0 random prediction, and −1 perfect inverse prediction. For AUC, 1 indicates perfectly accurate prediction and 0.5 indicates random prediction. ML method abbreviations: DT, decision trees; KNN, k-nearest neighbors; LR, logistic regression; RF, random forest; SVML, support vector machine with linear kernel; SVMR, support vector machine with radial kernel; XGBoost, extreme gradient boosting.

See this image and copyright information in PMC

Cited by

Association of general health and lifestyle factors with the salivary microbiota - Lessons learned from the ADDITION-PRO cohort.
Poulsen CS, Nygaard N, Constancias F, Stankevic E, Kern T, Witte DR, Vistisen D, Grarup N, Pedersen OB, Belstrøm D, Hansen T. Poulsen CS, et al. Front Cell Infect Microbiol. 2022 Nov 16;12:1055117. doi: 10.3389/fcimb.2022.1055117. eCollection 2022. Front Cell Infect Microbiol. 2022. PMID: 36467723 Free PMC article.

References

1. Aas J. A., Paster B. J., Stokes L. N., Olsen I., Dewhirst F. E. (2005). Defining the normal bacterial flora of the oral cavity. J. Clin. Microbiol. 43, 5721–5732. 10.1128/jcm.43.11.5721-5732.2005 - DOI - PMC - PubMed
1. Abd Elrahman S. M., Abraham A. (2013). A review of class imbalance problem. J. Netw. 1, 332–340.
1. Ali A., Shamsuddin S. M., Ralescu A. L. (2013). Classification with class imbalance problem. A review. Int. J. Advance Soft. Compu. Appl. 7, 176–204.
1. Ananthakrishnan A. N., Luo C., Yajnik V., Khalili H., Garber J. J., Stevens B. W., et al. . (2017). Gut microbiome function predicts response to anti-integrin biologic therapy in inflammatory bowel diseases. Cell Host Microbe 21 603–610.e603. 10.1016/j.chom.2017.04.010 - DOI - PMC - PubMed
1. Anyaso-Samuel S., Sachdeva A., Guha S., Datta S. (2021). Metagenomic geolocation prediction using an adaptive ensemble classifier. Front. Genet. 12, 642282. 10.3389/fgene.2021.642282 - DOI - PMC - PubMed

LinkOut - more resources

Full Text Sources

[1] Aas J. A., Paster B. J., Stokes L. N., Olsen I., Dewhirst F. E. (2005). Defining the normal bacterial flora of the oral cavity. J. Clin. Microbiol. 43, 5721–5732. 10.1128/jcm.43.11.5721-5732.2005 - DOI - PMC - PubMed

[2] Aas J. A., Paster B. J., Stokes L. N., Olsen I., Dewhirst F. E. (2005). Defining the normal bacterial flora of the oral cavity. J. Clin. Microbiol. 43, 5721–5732. 10.1128/jcm.43.11.5721-5732.2005 - DOI - PMC - PubMed

[3] Abd Elrahman S. M., Abraham A. (2013). A review of class imbalance problem. J. Netw. 1, 332–340.

[4] Abd Elrahman S. M., Abraham A. (2013). A review of class imbalance problem. J. Netw. 1, 332–340.

[5] Ali A., Shamsuddin S. M., Ralescu A. L. (2013). Classification with class imbalance problem. A review. Int. J. Advance Soft. Compu. Appl. 7, 176–204.

[6] Ali A., Shamsuddin S. M., Ralescu A. L. (2013). Classification with class imbalance problem. A review. Int. J. Advance Soft. Compu. Appl. 7, 176–204.

[7] Ananthakrishnan A. N., Luo C., Yajnik V., Khalili H., Garber J. J., Stevens B. W., et al. . (2017). Gut microbiome function predicts response to anti-integrin biologic therapy in inflammatory bowel diseases. Cell Host Microbe 21 603–610.e603. 10.1016/j.chom.2017.04.010 - DOI - PMC - PubMed

[8] Ananthakrishnan A. N., Luo C., Yajnik V., Khalili H., Garber J. J., Stevens B. W., et al. . (2017). Gut microbiome function predicts response to anti-integrin biologic therapy in inflammatory bowel diseases. Cell Host Microbe 21 603–610.e603. 10.1016/j.chom.2017.04.010 - DOI - PMC - PubMed

[9] Anyaso-Samuel S., Sachdeva A., Guha S., Datta S. (2021). Metagenomic geolocation prediction using an adaptive ensemble classifier. Front. Genet. 12, 642282. 10.3389/fgene.2021.642282 - DOI - PMC - PubMed

[10] Anyaso-Samuel S., Sachdeva A., Guha S., Datta S. (2021). Metagenomic geolocation prediction using an adaptive ensemble classifier. Front. Genet. 12, 642282. 10.3389/fgene.2021.642282 - DOI - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning

Affiliation

Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources