Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jun 18:12:677870.
doi: 10.3389/fimmu.2021.677870. eCollection 2021.

Microbiome Preprocessing Machine Learning Pipeline

Affiliations

Microbiome Preprocessing Machine Learning Pipeline

Yoel Jasner et al. Front Immunol. .

Abstract

Background: 16S sequencing results are often used for Machine Learning (ML) tasks. 16S gene sequences are represented as feature counts, which are associated with taxonomic representation. Raw feature counts may not be the optimal representation for ML.

Methods: We checked multiple preprocessing steps and tested the optimal combination for 16S sequencing-based classification tasks. We computed the contribution of each step to the accuracy as measured by the Area Under Curve (AUC) of the classification.

Results: We show that the log of the feature counts is much more informative than the relative counts. We further show that merging features associated with the same taxonomy at a given level, through a dimension reduction step for each group of bacteria improves the AUC. Finally, we show that z-scoring has a very limited effect on the results.

Conclusions: The prepossessing of microbiome 16S data is crucial for optimal microbiome based Machine Learning. These preprocessing steps are integrated into the MIPMLP - Microbiome Preprocessing Machine Learning Pipeline, which is available as a stand-alone version at: https://github.com/louzounlab/microbiome/tree/master/Preprocess or as a service at http://mip-mlp.math.biu.ac.il/Home Both contain the code, and standard test sets.

Keywords: 16S; ASV; OTU; feature selection; machine learning; pipeline.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

Figure 1
Figure 1
Pipeline process diagram. The input is an OTU/ASV table and the appropriate taxonomy. The features are merged to a given taxonomic level. We tested three possible merging methods: Sum, Average and a PCA on each sub-group of features. Following the merging, we performed either a log scaling or a relative scaling. Following scaling, we performed z scoring on either bacteria or samples or both, and finally, we tested whether performing a dimension reduction on the resulting merged and normalized features improves the accuracy of predictions.
Figure 2
Figure 2
Upper plot typical ROC and the effect of preprocessing. The right plot is using the sub-PCA merging method, while the left plot us using the average merging method. The right upper plot has a higher AUC than the left one. Middle plots - average AUC defined as average AUC using one feature (e.g. one taxonomy level) and all other combinations (e.g. merging methods, normalization etc). Lower plot - Predicted AUC in linear regression vs real linear regression.
Figure 3
Figure 3
Linear regression coefficient for SVM classifier. Coefficients are the contribution of a choice to the total AUC. Each group of coefficients is marked by a different color and normalized to 0. The following two figures follow this figure, but for different classifiers. The regression is over all parameter combinations, including the choice of taxonomy level (red), the grouping method (blue), the dimension reduction method (purple) and the normalization method (green). Since not all normalization and standardization methods are possible, we opened all tested combinations.
Figure 4
Figure 4
Linear regression coefficient for XGBosot classifier.
Figure 5
Figure 5
Linear regression coefficient for MLP classifier.
Figure 6
Figure 6
HFE and MIPMLP mean AUC with standard errors bar. Shaded bars are training set and full bars are test set. Error bars are standard errors. The y axis is AUC. Different groups of bars are different datasets.

References

    1. Blaxter M, Mann J, Chapman T, Thomas F, Whitton C, Floyd R, et al. . Defining Operational Taxonomic Units Using DNA Barcode Data. Philos Trans R Soc London Ser B Biol Sci (2005) 360:1935–43. 10.1098/rstb.2005.1725 - DOI - PMC - PubMed
    1. Schmidt TS, Rodrigues JFM, Von Mering C. Ecological Consistency of Ssu Rrna-Based Operational Taxonomic Units At A Global Scale. PloS Comput Biol (2014) 10(4). 10.1371/journal.pcbi.1003594 - DOI - PMC - PubMed
    1. Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, et al. . Qiime Allows Analysis of High-Throughput Community Sequencing Data. Nat Methods (2010) 7(5):335–6. 10.1038/nmeth.f.303 - DOI - PMC - PubMed
    1. Bolyen E, Rideout JR, Dillon MR, Bokulich NA, Abnet CC, Al-Ghalith GA, et al. . Reproducible, Interactive, Scalable and Extensible Microbiome Data Science Using Qiime 2. Nat Biotechnol (2019) 37(8):852–857. 10.1371 - PMC - PubMed
    1. Kopylova E, Noé L, Touzet H. Sortmerna: Fast and Accurate Filtering of Ribosomal RNAs in Metatranscriptomic Data. Bioinformatics (2012) 28(24):3211–3217. 10.1093/bioinformatics/bts611 - DOI - PubMed

Publication types

LinkOut - more resources