Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Oct 19;10(1):17635.
doi: 10.1038/s41598-020-74823-1.

Predicting human health from biofluid-based metabolomics using machine learning

Affiliations

Predicting human health from biofluid-based metabolomics using machine learning

Ethan D Evans et al. Sci Rep. .

Abstract

Biofluid-based metabolomics has the potential to provide highly accurate, minimally invasive diagnostics. Metabolomics studies using mass spectrometry typically reduce the high-dimensional data to only a small number of statistically significant features, that are often chemically identified-where each feature corresponds to a mass-to-charge ratio, retention time, and intensity. This practice may remove a substantial amount of predictive signal. To test the utility of the complete feature set, we train machine learning models for health state-prediction in 35 human metabolomics studies, representing 148 individual data sets. Models trained with all features outperform those using only significant features and frequently provide high predictive performance across nine health state categories, despite disparate experimental and disease contexts. Using only non-significant features it is still often possible to train models and achieve high predictive performance, suggesting useful predictive signal. This work highlights the potential for health state diagnostics using all metabolomics features with data-driven analysis.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
Body fluid based metabolomics often possesses health state-dependent signal and diagnostic capability. (A) Studies analyzed and their associated cohort sizes (control/case sizes to the right of the bar plot), separated by health state category. Shown in black are the controls, with the cases in color. Multiclass control bars correspond to the size of the first class for 3-class studies, or first 2 classes of 4- or 5-class studies, with the case bar representing the remaining samples. Multiclass studies are: D1 (Type 2 diabetes, prediabetic, healthy), H1 (never smokers, former smokers, smokers, COPD patients), D3 (Type 1 diabetes—insulin injection, Type 1 diabetes—insulin withdraw, no diabetes), A1 (Alzheimer's disease, mild cognitive impairment, normal), B5 (colorectal cancer patients, polyp patients, healthy controls), G1 (minimal change disease, focal segmental glomerulosclerosis, control), I2 (normal, pulmonary artery hypertension, low risk, healthy, borderline pressure). (B) Averaged ROC-AUC and standard deviation analysis for 30 L1-LR models, each trained and tested on different randomized, stratified shuffles of the within-study combined data sets. *Multiclass studies for which one-vs-one models were built. ^Studies for which it was not possible to combine data sets.
Figure 2
Figure 2
Health state information is found in all body fluids using different instruments, mass spectrometry ion modes, and chromatographic methods. (A) Ion mode, sample, and column type (for liquid chromatography) used for each individual data set. (B) Individual data set ROC-AUC and standard deviation analysis of 30 averaged L1-LR models, each trained and tested on different randomized, stratified shuffles, with associated sample and instrument type. (C) Violin plots for the comparison of non-multiclass, liquid chromatography to gas chromatography data sets (left), positive versus negative ion mode for all liquid chromatography data sets (middle), and C18 versus hydrophilic interaction chromatography (HILIC) columns (right), using AUC values from models trained on individual data sets, using all features.
Figure 3
Figure 3
Using all features generally leads to the best model performance. (A) Comparison of AUC values from L1-LR models built from within-study combined data sets versus the average AUC of independent models built on non-combined data sets. The average difference between combined and non-combined model AUCs, along with the P-value for the comparison (MW-U test). (B, top) Comparison of model performance on individual data sets with at least one statistically significant feature. Circle size is proportional to the number of features. The purple outlier point with better AUC using only significant features is from a HILIC, CSF sample in positive ionization from study A1. (B, bottom with shared y-axis label) Comparison of performance between models trained using all features versus ‘significant feature only’ models when there were no significant features with accompanying violin plot of the AUC distribution of models built using all features. Delta AUC values were determined by subtracting the AUC of a model built with only significant features from that of a model trained on all features, and then averaged over all data sets. (C) Comparison of models built using up to 5, 10, 50 and 100 of the most significant features (lowest Q-value) or all significant features, versus models trained using all features; results displayed only for data sets that possessed significant features. Purple outlines indicate data sets with fewer than the cutoff number of significant features and the fraction at the bottom depicts the number of data sets meeting the cutoff. P-values correspond to MW-U tests between the AUC values from models using all features versus the AUC values from models built with a select number of significant features. Health state color legend for (A) and (B) shown in the bottom right.
Figure 4
Figure 4
Machine learning models are relatively sparse and frequently use features spanning a large mass range and both significance types. (A) Fraction of non-zero feature coefficients in the models corresponding to statistically significant features (p < 0.05 FDR-corrected, MW-U test) for single models trained on individual data sets. (B) Number of input features (colored points) relative to the number of non-zero model features (vertical dashes) for a single, representative model training for each data set. (C) Representative plots of the features and associated average model coefficients (30 model trainings) used across the range of observed mass-to-charge ratios. Significant versus non-significant features are depicted in different colors (red, q < 0.05; black, q ≥ 0.05). Data sets (clockwise, starting from the upper left): breast cancer (B1), lung cancer (B14), hepatocellular carcinoma (B7), hepatocellular carcinoma (B8), diabetes (D1), Alzheimer’s disease (A2), coronary heart disease (C1), age (F1), smoking versus non-smoking (F4).
Figure 5
Figure 5
Models trained with only non-significant features often retain relatively high diagnostic performance. (A) AUC comparison between models trained using all features and those using no statistically significant features for data sets with at least one significant feature. Circle size is proportional to the log of the number of features in the data set. (B) Fraction of non-significant features for each high resolution mass spectrometry data set that can be explained as adducts or isotopes of statistically significant features. Numbers on the right are the average number of features across a study’s data sets, with standard deviation. (C) AUC of models trained and tested using only non-significant features versus the fraction of non-significant features explained by adducts and isotopes in the input data set. (D) AUC comparison between models trained using only non-significant features versus non-significant features without adducts or isotopes of significant features. Circle size is proportional to the log of the number of features in data sets from which significant features, their adducts, and isotopes have been removed.

References

    1. Strimbu K, Tavel JA. What are Biomarkers? Curr. Opin. HIV AIDS. 2010;5:463–466. doi: 10.1097/COH.0b013e32833ed177. - DOI - PMC - PubMed
    1. Mayeux R, et al. Utility of the apolipoprotein E genotype in the diagnosis of Alzheimer’s disease. Alzheimer’s Disease Centers Consortium on Apolipoprotein E and Alzheimer’s Disease. N. Engl. J. Med. 1998;338:506–511. doi: 10.1056/NEJM199802193380804. - DOI - PubMed
    1. Hayes JH, Barry MJ. Screening for prostate cancer with the prostate-specific antigen test: A review of current evidence. JAMA. 2014;311:1143–1149. doi: 10.1001/jama.2014.2085. - DOI - PubMed
    1. Kelly S-L, Bird TG. The evolution of the use of serum alpha-fetoprotein in clinical liver cancer surveillance. J. Immunobiol. 2016;1:2. - PMC - PubMed
    1. Gold L, et al. Aptamer-based multiplexed proteomic technology for biomarker discovery. PLoS ONE. 2010;5:e15004. doi: 10.1371/journal.pone.0015004. - DOI - PMC - PubMed

Publication types

LinkOut - more resources