Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jul 19:13:886201.
doi: 10.3389/fmicb.2022.886201. eCollection 2022.

Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning

Affiliations

Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning

Celia Díez López et al. Front Microbiol. .

Abstract

Human microbiome research is moving from characterization and association studies to translational applications in medical research, clinical diagnostics, and others. One of these applications is the prediction of human traits, where machine learning (ML) methods are often employed, but face practical challenges. Class imbalance in available microbiome data is one of the major problems, which, if unaccounted for, leads to spurious prediction accuracies and limits the classifier's generalization. Here, we investigated the predictability of smoking habits from class-imbalanced saliva microbiome data by combining data augmentation techniques to account for class imbalance with ML methods for prediction. We collected publicly available saliva 16S rRNA gene sequencing data and smoking habit metadata demonstrating a serious class imbalance problem, i.e., 175 current vs. 1,070 non-current smokers. Three data augmentation techniques (synthetic minority over-sampling technique, adaptive synthetic, and tree-based associative data augmentation) were applied together with seven ML methods: logistic regression, k-nearest neighbors, support vector machine with linear and radial kernels, decision trees, random forest, and extreme gradient boosting. K-fold nested cross-validation was used with the different augmented data types and baseline non-augmented data to validate the prediction outcome. Combining data augmentation with ML generally outperformed baseline methods in our dataset. The final prediction model combined tree-based associative data augmentation and support vector machine with linear kernel, and achieved a classification performance expressed as Matthews correlation coefficient of 0.36 and AUC of 0.81. Our method successfully addresses the problem of class imbalance in microbiome data for reliable prediction of smoking habits.

Keywords: class imbalance; data augmentation; human microbiome; machine learning; prediction modeling; saliva microbiome; smoking status; trait prediction.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

Figure 1
Figure 1
Overview of the study's analytical strategy. (A–C) The original dataset was split into a training set (80%) (purple box in B) and a holdout test set (20%) (red box in C) by maintaining the original ratio between classes in the partitions. Data augmentation techniques were applied to the training set, making a total of six different input data types (d = 6), including baseline non-augmented and differently augmented data types. (D) For the nested cross-validation (nCV) approach, the training set was split into five outer k-folds of training (80%) (orange box in D) and test (20%) (blue box in D) sets each. (E) Each outer k-fold was split into two inner n-folds of training (50%) and validation (50%) sets (orange box in E) in which seven different machine learning (ML) models (m = 7) were optimized and validated (inner models). (F) The best-performing n-fold inner model (green box in F) was applied to the corresponding k-fold test set (green arrow to blue box in F). (G) For each k-fold test set, two performance metrics were obtained: Matthews correlation coefficient (MCC) and area under the receiver operating characteristic curve (AUC). Repetition of steps (D) to (G) for all the input data types (d = 6) with ML method (m = 7) (total of 42 different approaches). (H) Repetition of steps (A) to (G) 10 times (i = 10) to control for introduced variation by data partitions. (I) Selection of the best-performing data type with ML method based on MCC metric and training on full final 80% training set to create the final prediction model. (J) Validation of final prediction model on final 20% holdout test set.
Figure 2
Figure 2
Validation of data types with machine learning (ML) methods for microbiome-based prediction of smoking habits based on the S1 and S2 datasets together. For each ML method, we evaluated six types of input data: baseline non-augmented and five augmented datasets based on different methods (ADASYN-1, ADASYN-2, SMOTE-1, SMOTE-2, and TADA). (A) Matthews correlation coefficient (MCC) and (B) area under the receiver operating characteristic curve (AUC) values from the 5-fold nested crossed-validation were repeated for 10 times (5 * 10). For MCC, +1 represents a perfect prediction, 0 random prediction, and −1 perfect inverse prediction. For AUC, 1 indicates perfectly accurate prediction and 0.5 indicates random prediction. ML method abbreviations: DT, decision trees; KNN, k-nearest neighbors; LR, logistic regression; RF, random forest; SVML, support vector machine with linear kernel; SVMR, support vector machine with radial kernel; XGBoost, extreme gradient boosting.

Similar articles

Cited by

References

    1. Aas J. A., Paster B. J., Stokes L. N., Olsen I., Dewhirst F. E. (2005). Defining the normal bacterial flora of the oral cavity. J. Clin. Microbiol. 43, 5721–5732. 10.1128/jcm.43.11.5721-5732.2005 - DOI - PMC - PubMed
    1. Abd Elrahman S. M., Abraham A. (2013). A review of class imbalance problem. J. Netw. 1, 332–340.
    1. Ali A., Shamsuddin S. M., Ralescu A. L. (2013). Classification with class imbalance problem. A review. Int. J. Advance Soft. Compu. Appl. 7, 176–204.
    1. Ananthakrishnan A. N., Luo C., Yajnik V., Khalili H., Garber J. J., Stevens B. W., et al. . (2017). Gut microbiome function predicts response to anti-integrin biologic therapy in inflammatory bowel diseases. Cell Host Microbe 21 603–610.e603. 10.1016/j.chom.2017.04.010 - DOI - PMC - PubMed
    1. Anyaso-Samuel S., Sachdeva A., Guha S., Datta S. (2021). Metagenomic geolocation prediction using an adaptive ensemble classifier. Front. Genet. 12, 642282. 10.3389/fgene.2021.642282 - DOI - PMC - PubMed

LinkOut - more resources