Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Feb 24;12(3):202.
doi: 10.3390/metabo12030202.

A Comprehensive Evaluation of Metabolomics Data Preprocessing Methods for Deep Learning

Affiliations

A Comprehensive Evaluation of Metabolomics Data Preprocessing Methods for Deep Learning

Krzysztof Jan Abram et al. Metabolites. .

Abstract

Machine learning has greatly advanced over the past decade, owing to advances in algorithmic innovations, hardware acceleration, and benchmark datasets to train on domains such as computer vision, natural-language processing, and more recently the life sciences. In particular, the subfield of machine learning known as deep learning has found applications in genomics, proteomics, and metabolomics. However, a thorough assessment of how the data preprocessing methods required for the analysis of life science data affect the performance of deep learning is lacking. This work contributes to filling that gap by assessing the impact of commonly used as well as newly developed methods employed in data preprocessing workflows for metabolomics that span from raw data to processed data. The results from these analyses are summarized into a set of best practices that can be used by researchers as a starting point for downstream classification and reconstruction tasks using deep learning.

Keywords: deep learning; metabolomics; preprocessing.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
Data preprocessing workflow for metabolomics. (A) Workflow overview. Red lettering in the workflow diagram indicates steps that lead to so-called “Batch effects” between metabolomics runs. Black lettering to the right of each workflow box indicates preprocessing options that were explored in this study for the corresponding data preprocessing workflow steps. (B,C). Datasets 1 and 2 used for the classification task (see Methods Section 3.1). (B). The dataset included 7 different strains of E. coli where the task was to predict the correct strain from metabolite levels. Importantly, the data were run twice (once on two different instruments) resulting in batch effects between the two runs. (C). The dataset included 5 different single-knockout (KO) strains of E. coli where the task was to predict the correct knockout from metabolite levels.
Figure 2
Figure 2
Missing value imputation methods. (A) Schematic of 3 replicate samples (eppi tubes) with 4 metabolite data points (circles). Filled circles indicate measured values and open circles indicate missing values. (B) Missing values are usually imputed (grey filled circles) prior to downstream machine-learning tasks. (C) Alternatively, the values for each metabolite from all replicates can be pooled and sampled to generate values for replicates on the fly. The numbers next to each circle indicate the replicate number from which the value came. (D) Overview of the procedure for calculating Mass Action Rations (MARs) for metabolomics data given a metabolic network stoichiometry. Please note that the number of features per sample changes from the number of metabolites measured to the number of metabolic reactions in the metabolic network. (E) MARs can be calculated by sampling the metabolite values from each replicate.
Figure 3
Figure 3
Summary of the classification task and evaluation procedure. (A) Schematic of the machine-learning classification task using metabolite levels as input and sample labels as output. (B) Diagram of the evaluation procedure. Training and test losses are depicted in blue and red. Summary of classification results for Dataset 1 (C) and Dataset 2 (D). The bars represent the model training iterations to minimum loss function scores, the model minimum loss function scores, and the model maximum accuracy (n = 3 training runs). Error bars represent the standard deviation (n = 3 training runs). Model training iterations to minimum loss and model loss scores are scaled from 0 to 1. The data are sorted in descending order based on model accuracy. The top 25 models are shown. Source data are provided in Tables S1–S3 Abbreviations: BN—Biomass normalization; Impute—Imputation method; Trans—Transformation method.
Figure 4
Figure 4
Summary of the reconstruction task and evaluation procedure. (A) Schematic of the machine-learning reconstruction and joint reconstruction and classification tasks using metabolite levels as input and sample labels as output. (B) Diagram of the evaluation procedure. Summary of Reconstruction results for Dataset 1 (C,D) and Dataset 2 (E,F). The bars represent the model training iterations to minimum loss function scores, the model minimum loss function scores, and the model metric scores (n = 2–3 training runs). Error bars represent the standard deviation (n = 2–3 training runs). Reconstruction metrics for Pearson’s R, Euclidean distance, and Absolute percent difference are shown. Model training iterations to minimum loss, model loss scores, Euclidean distance, and Absolute percent difference are scaled from 0 to 1. The Imputation method of Sampling was used for all models. The data are sorted in ascending order based on Euclidean distance (C,E) and Absolute percent difference (D,F). The top 10 models are shown. Source data are provided in Tables S4 and S5. Abbreviations: BN—Biomass normalization; Impute—Imputation method; Trans—Transformation method; Loss—Loss function; MSE—Mean squared error; MAE—Mean absolute error; MAPE—Mean absolute percent error.
Figure 5
Figure 5
Summary of the disentanglement tasks. (A) Schematic of the machine-learning joint reconstruction task using metabolite levels as input and sample labels and metabolite levels as output. The reconstruction task entailed encoding the input features into a lower dimensional latent space and then decoding the compressed representation back to the original input features. A more meaningful latent space is one where factors of variation in the input features are disentangled into specific regions of the latent space. (B) Schematic of the direct latent space classification subtask. The classification task entailed capturing the input labels directly in the discreet encodings of the latent space. (C) Schematic of the latent traversal and reconstruction similarity subtask. The latent traversal and reconstruction similarity subtasks entailed traversing the 95% confidence intervals of the continuous encodings and one hot vectors of the discreet encodings, then decoding and evaluating the resulting reconstruction against randomly sampled inputs with known labels. Summary of the joint reconstruction and classification task results for Dataset 1 (D) and Dataset 2 (E). Summary of the influence of supervision on the direct latent space classification subtask for Dataset 1 (F) and Dataset 2 (G). Summary of the influence of latent space architecture on reconstruction accuracy for Dataset 1 (H) and Dataset 2 (I). The bars represent the model training iterations to minimum loss function scores, the model minimum loss function scores, and/or the model metric scores as specified per the legend next to each figure panel. Error bars represent the standard deviation. n = 2 for all models shown except for models with an architecture of 1C6D and 1C7D (n = 12) shown in (H,I). Model training iterations to minimum loss, model loss scores, Euclidean distance, and Absolute percent difference are scaled from 0 to 1. Top 6 models sorted by classification accuracy are shown in (D,E); Top 10 models sorted by classification accuracy are shown in (E,F); All models sorted by reconstruction distance are shown in (G,H). Biomass normalization of ConcsBN and Imputation method of Sampling was used for all models. Source data are provided in Tables S6–S9. Abbreviations: Trans—Transformation method; Loss—Loss function; Supervision—Percent supervision used for classification; Architecture—Model architecture used where xC is the number of continuous valued nodes and xD is the number of discrete valued nodes in the latent space; MSE—Mean squared error; MAE—Mean absolute error; MAPE—Mean absolute percent error.

References

    1. Goh A.T.C. Back-propagation neural networks for modeling complex systems. Artif. Intell. Eng. 1995;9:143–151. doi: 10.1016/0954-1810(94)00011-S. - DOI
    1. Kingma D.P., Ba J. Adam: A Method for Stochastic Optimization. arXiv. 20141412.6980
    1. Hestness J., Narang S., Ardalani N., Diamos G., Jun H., Kianinejad H., Patwary M.M.A., Yang Y., Zhou Y. Deep Learning Scaling is Predictable, Empirically. arXiv. 20171712.00409
    1. Shorten C., Khoshgoftaar T.M. A survey on Image Data Augmentation for Deep Learning. J. Big Data. 2019;6:1–48. doi: 10.1186/s40537-019-0197-0. - DOI - PMC - PubMed
    1. Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser L., Polosukhin I. Attention Is All You Need. arXiv. 20171706.03762

LinkOut - more resources