. 2021 Jun 18:12:677870.

doi: 10.3389/fimmu.2021.677870. eCollection 2021.

Microbiome Preprocessing Machine Learning Pipeline

Yoel Jasner¹, Anna Belogolovski¹, Meirav Ben-Itzhak¹, Omry Koren², Yoram Louzoun¹

Affiliations

¹ Department of Mathematics, Bar-Ilan University, Ramat Gan, Israel.
² Azrieli Faculty of Medicine, Bar-Ilan University, Ramat Gan, Israel.

PMID: 34220823
PMCID: PMC8250139
DOI: 10.3389/fimmu.2021.677870

Microbiome Preprocessing Machine Learning Pipeline

Yoel Jasner et al. Front Immunol. 2021.

. 2021 Jun 18:12:677870.

doi: 10.3389/fimmu.2021.677870. eCollection 2021.

Authors

Yoel Jasner¹, Anna Belogolovski¹, Meirav Ben-Itzhak¹, Omry Koren², Yoram Louzoun¹

Affiliations

¹ Department of Mathematics, Bar-Ilan University, Ramat Gan, Israel.
² Azrieli Faculty of Medicine, Bar-Ilan University, Ramat Gan, Israel.

PMID: 34220823
PMCID: PMC8250139
DOI: 10.3389/fimmu.2021.677870

Abstract

Background: 16S sequencing results are often used for Machine Learning (ML) tasks. 16S gene sequences are represented as feature counts, which are associated with taxonomic representation. Raw feature counts may not be the optimal representation for ML.

Methods: We checked multiple preprocessing steps and tested the optimal combination for 16S sequencing-based classification tasks. We computed the contribution of each step to the accuracy as measured by the Area Under Curve (AUC) of the classification.

Results: We show that the log of the feature counts is much more informative than the relative counts. We further show that merging features associated with the same taxonomy at a given level, through a dimension reduction step for each group of bacteria improves the AUC. Finally, we show that z-scoring has a very limited effect on the results.

Conclusions: The prepossessing of microbiome 16S data is crucial for optimal microbiome based Machine Learning. These preprocessing steps are integrated into the MIPMLP - Microbiome Preprocessing Machine Learning Pipeline, which is available as a stand-alone version at: https://github.com/louzounlab/microbiome/tree/master/Preprocess or as a service at http://mip-mlp.math.biu.ac.il/Home Both contain the code, and standard test sets.

Keywords: 16S; ASV; OTU; feature selection; machine learning; pipeline.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**Figure 1**
Pipeline process diagram. The input is an OTU/ASV table and the appropriate taxonomy. The features are merged to a given taxonomic level. We tested three possible merging methods: Sum, Average and a PCA on each sub-group of features. Following the merging, we performed either a log scaling or a relative scaling. Following scaling, we performed z scoring on either bacteria or samples or both, and finally, we tested whether performing a dimension reduction on the resulting merged and normalized features improves the accuracy of predictions.

**Figure 2**
Upper plot typical ROC and the effect of preprocessing. The right plot is using the sub-PCA merging method, while the left plot us using the average merging method. The right upper plot has a higher AUC than the left one. Middle plots - average AUC defined as average AUC using one feature (e.g. one taxonomy level) and all other combinations (e.g. merging methods, normalization etc). Lower plot - Predicted AUC in linear regression vs real linear regression.

**Figure 3**
Linear regression coefficient for SVM classifier. Coefficients are the contribution of a choice to the total AUC. Each group of coefficients is marked by a different color and normalized to 0. The following two figures follow this figure, but for different classifiers. The regression is over all parameter combinations, including the choice of taxonomy level (red), the grouping method (blue), the dimension reduction method (purple) and the normalization method (green). Since not all normalization and standardization methods are possible, we opened all tested combinations.

**Figure 4**
Linear regression coefficient for XGBosot classifier.

**Figure 5**
Linear regression coefficient for MLP classifier.

**Figure 6**
HFE and MIPMLP mean AUC with standard errors bar. Shaded bars are training set and full bars are test set. Error bars are standard errors. The y axis is AUC. Different groups of bars are different datasets.

See this image and copyright information in PMC

References

1. Blaxter M, Mann J, Chapman T, Thomas F, Whitton C, Floyd R, et al. . Defining Operational Taxonomic Units Using DNA Barcode Data. Philos Trans R Soc London Ser B Biol Sci (2005) 360:1935–43. 10.1098/rstb.2005.1725 - DOI - PMC - PubMed
1. Schmidt TS, Rodrigues JFM, Von Mering C. Ecological Consistency of Ssu Rrna-Based Operational Taxonomic Units At A Global Scale. PloS Comput Biol (2014) 10(4). 10.1371/journal.pcbi.1003594 - DOI - PMC - PubMed
1. Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, et al. . Qiime Allows Analysis of High-Throughput Community Sequencing Data. Nat Methods (2010) 7(5):335–6. 10.1038/nmeth.f.303 - DOI - PMC - PubMed
1. Bolyen E, Rideout JR, Dillon MR, Bokulich NA, Abnet CC, Al-Ghalith GA, et al. . Reproducible, Interactive, Scalable and Extensible Microbiome Data Science Using Qiime 2. Nat Biotechnol (2019) 37(8):852–857. 10.1371 - PMC - PubMed
1. Kopylova E, Noé L, Touzet H. Sortmerna: Fast and Accurate Filtering of Ribosomal RNAs in Metatranscriptomic Data. Bioinformatics (2012) 28(24):3211–3217. 10.1093/bioinformatics/bts611 - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Microbiome Preprocessing Machine Learning Pipeline

Affiliations

Microbiome Preprocessing Machine Learning Pipeline

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources