Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2022 Nov 24:13:1017340.
doi: 10.3389/fgene.2022.1017340. eCollection 2022.

Applications of machine learning in metabolomics: Disease modeling and classification

Affiliations
Review

Applications of machine learning in metabolomics: Disease modeling and classification

Aya Galal et al. Front Genet. .

Abstract

Metabolomics research has recently gained popularity because it enables the study of biological traits at the biochemical level and, as a result, can directly reveal what occurs in a cell or a tissue based on health or disease status, complementing other omics such as genomics and transcriptomics. Like other high-throughput biological experiments, metabolomics produces vast volumes of complex data. The application of machine learning (ML) to analyze data, recognize patterns, and build models is expanding across multiple fields. In the same way, ML methods are utilized for the classification, regression, or clustering of highly complex metabolomic data. This review discusses how disease modeling and diagnosis can be enhanced via deep and comprehensive metabolomic profiling using ML. We discuss the general layout of a metabolic workflow and the fundamental ML techniques used to analyze metabolomic data, including support vector machines (SVM), decision trees, random forests (RF), neural networks (NN), and deep learning (DL). Finally, we present the advantages and disadvantages of various ML methods and provide suggestions for different metabolic data analysis scenarios.

Keywords: biomarkers; deep learning; machine learning; metabolic disorders; metabolomics.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

FIGURE 1
FIGURE 1
Principles of metabolomics experimental design and associated ML workflow. The left panel describes the various sources of metabolites. Metabolite exposure can be through endogenous and exogenous means, e.g., human-encoded, microbiome-encoded, food, drugs, and toxins. Metabolic dysbiosis can be associated with metabolic disorders, e.g., cancer, cardiovascular disease, intestinal disorders, and diabetes. The center panel describes the typical flow and design of a metabolomic experiment, starting with the 1) study design where disease and control groups are determined, 2) followed by sample selection, e.g., urine, stool, blood, and serum, 3) collected samples undergo pre-treatment and processing according to experimental design, 4) data acquisition, e.g., through mass spectrometry or NMR, 5) feature selection involves the identification of desired metabolite features that will undergo subsequent, 6) data processing through the quantification of metabolites, and finally, 7) data analysis depends on the study design. The right panel describes the concepts of ML workflow and prediction, starting with 1) data wrangling and cleaning, 2) matrix construction, where data from each metabolite is placed in a matrix in reference to the conditions, i.e., disease (marked in red), control (marked in blue), 3) data are then divided into testing, validation and training datasets, 4) ML algorithm is applied, and 5) cross-validation, and testing of the predictive power of the algorithm on a test dataset. Created with BioRender.com.
FIGURE 2
FIGURE 2
Metabolomic publications using machine learning in data analytics over the past 2 decades. PubMed was searched using the keywords “metabolomics” and “machine learning” from 2002 to 2022. Results were manually filtered to remove review articles and irrelevant publications. The counted publications include studies that use any of the mentioned ML algorithms in the context of metabolomic analysis, including classification problems, biomarker discovery, peak identification, metabolomic data analysis tools, and others. Only ML algorithms employed for disease model building are considered. (A) The total number of publications per year. (B) The number of publications using ML methods per year. The y-axis in (A) and (B) are different because in (B), it indicates only the ML methods discussed in this review. The total number of publications across panels (A) and (B) varies because publications often utilize multiple ML algorithms.
FIGURE 3
FIGURE 3
Machine learning algorithms categories. ML algorithms are divided into four main classes: Supervised, Unsupervised, Semi-supervised, and Reinforcement learning. The category choice depends on the type and nature of the data under investigation, i.e., labeled or unlabelled data.
FIGURE 4
FIGURE 4
Representation of most commonly used ML algorithms with functional categorization accompanied by graphical representations of each algorithm and some potential applications. The most frequently used algorithms can be grouped into regression (linear and logistic), clustering (k-means, k-NN, hierarchical clustering, NN), and classification (Naive Bayes, SVM, Decision trees). Created with BioRender.com.
FIGURE 5
FIGURE 5
Support Vector Machines (SVM) construct a hyperplane to separate data into two classes. Axes represent different features. Green triangles and blue circles represent different conditions (e.g., disease vs. control). The margin (red dotted line) is the distance between the hyperplane and the support vectors (the nearest data point of each class).
FIGURE 6
FIGURE 6
The “kernel trick” - non-linearly separable data points are mapped into a higher dimensional feature space in which they become linearly separable. Axes represent different features. Green triangles and blue circles represent different conditions (e.g., disease vs. control). The hyperplane, in this case, becomes a two-dimensional plane.
FIGURE 7
FIGURE 7
Basic neural network architecture. Circles represent neurons. w1, w2, and w3 represent weights by which values calculated inside neurons are multiplied before being passed on to the next layer. In the hidden layer neurons, values are passed into an activation function (e.g., the ReLU function), while the output layer neuron applies a classifier function (e.g., the Softmax function) to input values.
FIGURE 8
FIGURE 8
Gradient descent; initial network parameters (weights and biases) are adjusted in a direction that travels down the slope of the cost function (green curve) until the minimum is reached.

Similar articles

Cited by

References

    1. Abram K. J., McCloskey D. (2022). A comprehensive evaluation of metabolomics data preprocessing methods for deep learning. Metabolites 12 (3), 202. 10.3390/metabo12030202 - DOI - PMC - PubMed
    1. Aderemi A. V., Ayeleso A. O., Oyedapo O. O., Mukwevho E. (2021). Metabolomics: A scoping review of its role as a tool for disease biomarker discovery in selected non-communicable diseases. Metabolites 11 (7), 418. 10.3390/metabo11070418 - DOI - PMC - PubMed
    1. Ahola-Olli A. V., Mustelin L., Kalimeri M., Kettunen J., Jokelainen J., Auvinen J., et al. (2019). Circulating metabolites and the risk of type 2 diabetes: A prospective study of 11, 896 young adults from four Finnish cohorts. Diabetologia 62 (12), 2298–2309. 10.1007/s00125-019-05001-w - DOI - PMC - PubMed
    1. Airola A., Pahikkala T., Waegeman W., De Baets B., Salakoski T. (2011). An experimental comparison of cross-validation techniques for estimating the area under the ROC curve. Comput. Statistics Data Analysis 55 (4), 1828–1844. 10.1016/j.csda.2010.11.018 - DOI
    1. Alakwaa F. M., Chaudhary K., Garmire L. X. (2018). Deep learning accurately predicts estrogen receptor status in breast cancer metabolomics data. J. Proteome Res. 17 (1), 337–347. 10.1021/acs.jproteome.7b00595 - DOI - PMC - PubMed

LinkOut - more resources