Review

. 2022 Nov 24:13:1017340.

doi: 10.3389/fgene.2022.1017340. eCollection 2022.

Applications of machine learning in metabolomics: Disease modeling and classification

Aya Galal^{1

2}, Marwa Talal^{1

3}, Ahmed Moustafa^{1

3

4}

Affiliations

¹ Systems Genomics Laboratory, American University in Cairo, New Cairo, Egypt.
² Institute of Global Health and Human Ecology, American University in Cairo, New Cairo, Egypt.
³ Biotechnology Graduate Program, American University in Cairo, New Cairo, Egypt.
⁴ Department of Biology, American University in Cairo, New Cairo, Egypt.

PMID: 36506316
PMCID: PMC9730048
DOI: 10.3389/fgene.2022.1017340

Review

Applications of machine learning in metabolomics: Disease modeling and classification

Aya Galal et al. Front Genet. 2022.

. 2022 Nov 24:13:1017340.

doi: 10.3389/fgene.2022.1017340. eCollection 2022.

Authors

Aya Galal^{1

2}, Marwa Talal^{1

3}, Ahmed Moustafa^{1

3

4}

Affiliations

¹ Systems Genomics Laboratory, American University in Cairo, New Cairo, Egypt.
² Institute of Global Health and Human Ecology, American University in Cairo, New Cairo, Egypt.
³ Biotechnology Graduate Program, American University in Cairo, New Cairo, Egypt.
⁴ Department of Biology, American University in Cairo, New Cairo, Egypt.

PMID: 36506316
PMCID: PMC9730048
DOI: 10.3389/fgene.2022.1017340

Abstract

Metabolomics research has recently gained popularity because it enables the study of biological traits at the biochemical level and, as a result, can directly reveal what occurs in a cell or a tissue based on health or disease status, complementing other omics such as genomics and transcriptomics. Like other high-throughput biological experiments, metabolomics produces vast volumes of complex data. The application of machine learning (ML) to analyze data, recognize patterns, and build models is expanding across multiple fields. In the same way, ML methods are utilized for the classification, regression, or clustering of highly complex metabolomic data. This review discusses how disease modeling and diagnosis can be enhanced via deep and comprehensive metabolomic profiling using ML. We discuss the general layout of a metabolic workflow and the fundamental ML techniques used to analyze metabolomic data, including support vector machines (SVM), decision trees, random forests (RF), neural networks (NN), and deep learning (DL). Finally, we present the advantages and disadvantages of various ML methods and provide suggestions for different metabolic data analysis scenarios.

Keywords: biomarkers; deep learning; machine learning; metabolic disorders; metabolomics.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**FIGURE 1**
Principles of metabolomics experimental design and associated ML workflow. The left panel describes the various sources of metabolites. Metabolite exposure can be through endogenous and exogenous means, e.g., human-encoded, microbiome-encoded, food, drugs, and toxins. Metabolic dysbiosis can be associated with metabolic disorders, e.g., cancer, cardiovascular disease, intestinal disorders, and diabetes. The center panel describes the typical flow and design of a metabolomic experiment, starting with the 1) study design where disease and control groups are determined, 2) followed by sample selection, e.g., urine, stool, blood, and serum, 3) collected samples undergo pre-treatment and processing according to experimental design, 4) data acquisition, e.g., through mass spectrometry or NMR, 5) feature selection involves the identification of desired metabolite features that will undergo subsequent, 6) data processing through the quantification of metabolites, and finally, 7) data analysis depends on the study design. The right panel describes the concepts of ML workflow and prediction, starting with 1) data wrangling and cleaning, 2) matrix construction, where data from each metabolite is placed in a matrix in reference to the conditions, i.e., disease (marked in red), control (marked in blue), 3) data are then divided into testing, validation and training datasets, 4) ML algorithm is applied, and 5) cross-validation, and testing of the predictive power of the algorithm on a test dataset. Created with BioRender.com.

**FIGURE 2**
Metabolomic publications using machine learning in data analytics over the past 2 decades. PubMed was searched using the keywords “metabolomics” and “machine learning” from 2002 to 2022. Results were manually filtered to remove review articles and irrelevant publications. The counted publications include studies that use any of the mentioned ML algorithms in the context of metabolomic analysis, including classification problems, biomarker discovery, peak identification, metabolomic data analysis tools, and others. Only ML algorithms employed for disease model building are considered. **(A)** The total number of publications per year. **(B)** The number of publications using ML methods per year. The y-axis in **(A)** and **(B)** are different because in **(B)**, it indicates only the ML methods discussed in this review. The total number of publications across panels **(A)** and **(B)** varies because publications often utilize multiple ML algorithms.

**FIGURE 3**
Machine learning algorithms categories. ML algorithms are divided into four main classes: Supervised, Unsupervised, Semi-supervised, and Reinforcement learning. The category choice depends on the type and nature of the data under investigation, i.e., labeled or unlabelled data.

**FIGURE 4**
Representation of most commonly used ML algorithms with functional categorization accompanied by graphical representations of each algorithm and some potential applications. The most frequently used algorithms can be grouped into regression (linear and logistic), clustering (k-means, k-NN, hierarchical clustering, NN), and classification (Naive Bayes, SVM, Decision trees). Created with BioRender.com.

**FIGURE 5**
Support Vector Machines (SVM) construct a hyperplane to separate data into two classes. Axes represent different features. Green triangles and blue circles represent different conditions (e.g., disease vs. control). The margin (red dotted line) is the distance between the hyperplane and the support vectors (the nearest data point of each class).

**FIGURE 6**
The “kernel trick” - non-linearly separable data points are mapped into a higher dimensional feature space in which they become linearly separable. Axes represent different features. Green triangles and blue circles represent different conditions (e.g., disease vs. control). The hyperplane, in this case, becomes a two-dimensional plane.

**FIGURE 7**
Basic neural network architecture. Circles represent neurons. w₁, w_2, and w₃ represent weights by which values calculated inside neurons are multiplied before being passed on to the next layer. In the hidden layer neurons, values are passed into an activation function (e.g., the ReLU function), while the output layer neuron applies a classifier function (e.g., the Softmax function) to input values.

**FIGURE 8**
Gradient descent; initial network parameters (weights and biases) are adjusted in a direction that travels down the slope of the cost function (green curve) until the minimum is reached.

See this image and copyright information in PMC

Cited by

Multi-Omics Analysis Reveals the Toxicity of Polyvinyl Chloride Microplastics toward BEAS-2B Cells.
Liu C, Chen S, Chu J, Yang Y, Yuan B, Zhang H. Liu C, et al. Toxics. 2024 May 30;12(6):399. doi: 10.3390/toxics12060399. Toxics. 2024. PMID: 38922079 Free PMC article.
Discovery of urinary biosignatures for tuberculosis and nontuberculous mycobacteria classification using metabolomics and machine learning.
Anh NK, Phat NK, Thu NQ, Tien NTN, Eunsu C, Kim HS, Nguyen DN, Kim DH, Long NP, Oh JY. Anh NK, et al. Sci Rep. 2024 Jul 3;14(1):15312. doi: 10.1038/s41598-024-66113-x. Sci Rep. 2024. PMID: 38961191 Free PMC article.
Early detection of feline chronic kidney disease via 3-hydroxykynurenine and machine learning.
Vanden Broecke E, Van Mulders L, De Paepe E, Paepe D, Daminet S, Vanhaecke L. Vanden Broecke E, et al. Sci Rep. 2025 Feb 26;15(1):6875. doi: 10.1038/s41598-025-90019-x. Sci Rep. 2025. PMID: 40011503 Free PMC article.
Survival analysis of patient groups defined by unsupervised machine learning clustering methods based on patient metabolomic data.
Bailleux C, Chardin D, Guigonis JM, Ferrero JM, Chateau Y, Humbert O, Pourcher T, Gal J. Bailleux C, et al. Comput Struct Biotechnol J. 2023 Oct 19;21:5136-5143. doi: 10.1016/j.csbj.2023.10.033. eCollection 2023. Comput Struct Biotechnol J. 2023. PMID: 37920813 Free PMC article.
A Comprehensive Machine Learning Approach for COVID-19 Target Discovery in the Small-Molecule Metabolome.
Sumon MSI, Hossain MSA, Al-Sulaiti H, Yassine HM, Chowdhury MEH. Sumon MSI, et al. Metabolites. 2025 Jan 11;15(1):44. doi: 10.3390/metabo15010044. Metabolites. 2025. PMID: 39852387 Free PMC article.

See all "Cited by" articles

References

1. Abram K. J., McCloskey D. (2022). A comprehensive evaluation of metabolomics data preprocessing methods for deep learning. Metabolites 12 (3), 202. 10.3390/metabo12030202 - DOI - PMC - PubMed
1. Aderemi A. V., Ayeleso A. O., Oyedapo O. O., Mukwevho E. (2021). Metabolomics: A scoping review of its role as a tool for disease biomarker discovery in selected non-communicable diseases. Metabolites 11 (7), 418. 10.3390/metabo11070418 - DOI - PMC - PubMed
1. Ahola-Olli A. V., Mustelin L., Kalimeri M., Kettunen J., Jokelainen J., Auvinen J., et al. (2019). Circulating metabolites and the risk of type 2 diabetes: A prospective study of 11, 896 young adults from four Finnish cohorts. Diabetologia 62 (12), 2298–2309. 10.1007/s00125-019-05001-w - DOI - PMC - PubMed
1. Airola A., Pahikkala T., Waegeman W., De Baets B., Salakoski T. (2011). An experimental comparison of cross-validation techniques for estimating the area under the ROC curve. Comput. Statistics Data Analysis 55 (4), 1828–1844. 10.1016/j.csda.2010.11.018 - DOI
1. Alakwaa F. M., Chaudhary K., Garmire L. X. (2018). Deep learning accurately predicts estrogen receptor status in breast cancer metabolomics data. J. Proteome Res. 17 (1), 337–347. 10.1021/acs.jproteome.7b00595 - DOI - PMC - PubMed

Publication types

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Applications of machine learning in metabolomics: Disease modeling and classification

Affiliations

Applications of machine learning in metabolomics: Disease modeling and classification

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

LinkOut - more resources

Full Text Sources

Other Literature Sources