Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2023 Oct 5:14:1250909.
doi: 10.3389/fmicb.2023.1250909. eCollection 2023.

Overview of data preprocessing for machine learning applications in human microbiome research

Affiliations
Review

Overview of data preprocessing for machine learning applications in human microbiome research

Eliana Ibrahimi et al. Front Microbiol. .

Abstract

Although metagenomic sequencing is now the preferred technique to study microbiome-host interactions, analyzing and interpreting microbiome sequencing data presents challenges primarily attributed to the statistical specificities of the data (e.g., sparse, over-dispersed, compositional, inter-variable dependency). This mini review explores preprocessing and transformation methods applied in recent human microbiome studies to address microbiome data analysis challenges. Our results indicate a limited adoption of transformation methods targeting the statistical characteristics of microbiome sequencing data. Instead, there is a prevalent usage of relative and normalization-based transformations that do not specifically account for the specific attributes of microbiome data. The information on preprocessing and transformations applied to the data before analysis was incomplete or missing in many publications, leading to reproducibility concerns, comparability issues, and questionable results. We hope this mini review will provide researchers and newcomers to the field of human microbiome research with an up-to-date point of reference for various data transformation tools and assist them in choosing the most suitable transformation method based on their research questions, objectives, and data characteristics.

Keywords: compositionality; data preprocessing; human microbiome; machine learning; metagenomics data; normalization.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

Figure 1
Figure 1
TreeMap chart illustrating the percentage of reviewed papers that applied normalization-based or compositional transformation methods, as well as the papers without clear information on preprocessing or data transformation. The other-normalization category comprises inverse-rank normalization, Box-Cox transformation, rarefaction, minimum-maximum transformation, scaling by standard deviation, normalization by total read depth, etc.

References

    1. Adade E. E., Al Lakhen K., Lemus A. A., Valm A. M. (2021). Recent progress in analyzing the spatial structure of the human microbiome: Distinguishing biogeography and architecture in the oral and gut communities. Curr. Opin. Endocr. Metab. Res. 18, 275–283. doi: 10.1016/j.coemr.2021.04.005, PMID: - DOI - PMC - PubMed
    1. Aitchison J. (1982). The statistical analysis of compositional data (with discussion). J R Stat Soc Series B. 44, 139–177.
    1. Aitchison J. (1986). The statistical analysis of compositional data. London: Chapman & Hall.
    1. Amir A., McDonald D., Navas-Molina J. A., Kopylova E., Morton J. T., Zech Xu Z., et al. . (2017). Deblur rapidly resolves single-nucleotide community sequence patterns. MSystems 2:e00191-16. doi: 10.1128/mSystems.00191-16, PMID: - DOI - PMC - PubMed
    1. Arksey H., O’Malley L. (2005). Scoping studies: towards a methodological framework. Int. J. Soc. Res. Methodol. 8, 19–32. doi: 10.1080/1364557032000119616 - DOI

LinkOut - more resources