Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jul 20;18(8):57.
doi: 10.1007/s11306-022-01918-3.

Lung cancer survival prediction and biomarker identification with an ensemble machine learning analysis of tumor core biopsy metabolomic data

Affiliations

Lung cancer survival prediction and biomarker identification with an ensemble machine learning analysis of tumor core biopsy metabolomic data

Hunter A Miller et al. Metabolomics. .

Abstract

Introduction: While prediction of short versus long term survival from lung cancer is clinically relevant in the context of patient management and therapy selection, it has proven difficult to identify reliable biomarkers of survival. Metabolomic markers from tumor core biopsies have been shown to reflect cancer metabolic dysregulation and hold prognostic value.

Objectives: Implement and validate a novel ensemble machine learning approach to evaluate survival based on metabolomic biomarkers from tumor core biopsies.

Methods: Data were obtained from tumor core biopsies evaluated with high-resolution 2DLC-MS/MS. Unlike biofluid samples, analysis of tumor tissue is expected to accurately reflect the cancer metabolism and its impact on patient survival. A comprehensive suite of machine learning algorithms were trained as base learners and then combined into a stacked-ensemble meta-learner for predicting "short" versus "long" survival on an external validation cohort. An ensemble method of feature selection was employed to find a reliable set of biomarkers with potential clinical utility.

Results: Overall survival (OS) is predicted in external validation cohort with AUROCTEST of 0.881 with support vector machine meta learner model, while progression-free survival (PFS) is predicted with AUROCTEST of 0.833 with boosted logistic regression meta learner model, outperforming a nomogram using covariate data (staging, age, sex, treatment vs. non-treatment) as predictors. Increased relative abundance of guanine, choline, and creatine corresponded with shorter OS, while increased leucine and tryptophan corresponded with shorter PFS. In patients that expired, N6,N6,N6-Trimethyl-L-lysine, L-pyrogluatmic acid, and benzoic acid were increased while cystine, methionine sulfoxide and histamine were decreased. In patients with progression, itaconic acid, pyruvate, and malonic acid were increased.

Conclusion: This study demonstrates the feasibility of an ensemble machine learning approach to accurately predict patient survival from tumor core biopsy metabolomic data.

Keywords: Artificial intelligence; Lung cancer; Machine learning; Metabolomics; Personalized medicine; Survival prediction.

PubMed Disclaimer

Conflict of interest statement

Disclosure of potential conflicts of interest: The authors declare that they have no competing interests.

Figures

Figure 1.
Figure 1.
Diagram of machine learning workflow. Baser learners are trained on the internal validation set using 5-fold cross validation with 10 resampling iterations on each feature subset. Feature selection is employed by base learner variable importance. After all base learners are trained and evaluated, a stacked ensemble model is evaluated after filtering base learners which did not achieve an AUROCTRAIN of 0.7 or greater across all feature subsets in the internal validation data. The ensemble model is then evaluated on all feature subsets using an ensemble method of feature selection (Equation 1). The classification model performance of all base-learners and meta-learners is evaluated across the feature subsets on the external validation data. EVTREE = tree models from genetic algorithms. RF = random forest. NNET = neural network (single layer). MLP = multi-layer perceptron. NSC = nearest shrunken centroids. NB = naïve Bayes. BGLM = boosted general linear model. KNN = k-nearest neighbors. SVM = support vector machine. SPLS = sparse partial least squares. BLR = boosted logistic regression. RLR = regularized logistic regression. NNFE = neural network with feature extraction. WKNN = weighted k-nearest neighbors. MANN = model averaged neural network. RRF = regularized random forest. BGAM = boosted generalized additive model. ORFSVM = oblique random forest with SVM as splitting model. SVMPoly = support vector machine with polynomial kernel.
Figure 2.
Figure 2.
Maximum AUROC obtained from feature selection after external test set validation of all base learner models and stacked ensemble meta learners for Overall Survival and Progression Free Survival. Patients were stratified into “long” and “short” survival groups for classification by the prediction models. Base learners which achieved a max AUROCTRAIN of 0.7 or above in the internal validation data (gray bars, top row) were selected for the stacked ensemble models (black bars, middle row). ROC curves of optimal stacked ensemble meta learners with repeated internal cross-validation (gray) and external validation (black) for prediction of “long” and “short” OS and PFS are shown for each case (bottom row).
Figure 3.
Figure 3.
Relative abundance of metabolites identified as significant for “short” vs. “long” OS and PFS by unpaired T-test assuming equal variance or Wilcoxon rank sum test, depending on normality of the data. Each box represents 1st and 3rd quartiles. Bands within represent the median and x is the mean. Ends of whiskers are maximum and minimum, with points outside being outliers. “Long” survival groups are in green and “Short” is in yellow (*p ≤ 0.1, **p ≤ 0.05). Color figure online.
Figure 4.
Figure 4.
Quantitative enrichment analysis. Enriched metabolic pathways were found with MetaboAnalyst 5.0 using KEGG pathway database for OS and PFS (Color figure online). KEGG database was accessed June 2022.

References

    1. AGRAWAL A, MISRA S, NARAYANAN R, POLEPEDDI L & CHOUDHARY A 2012. Lung Cancer Survival Prediction using Ensemble Data Mining on Seer Data. Scientific Programming, 20, 920245.
    1. AMELIO I, CUTRUZZOLA F, ANTONOV A, AGOSTINI M & MELINO G 2014. Serine and glycine metabolism in cancer. Trends Biochem Sci, 39, 191–8. - PMC - PubMed
    1. ANANIEVA EA & WILKINSON AC 2018. Branched-chain amino acid metabolism in cancer. Curr Opin Clin Nutr Metab Care, 21, 64–70. - PMC - PubMed
    1. BAMJI-STOCKE S, VAN BERKEL V, MILLER DM & FRIEBOES HB 2018. A review of metabolism-associated biomarkers in lung cancer diagnosis and treatment. Metabolomics, 14, 81. - PMC - PubMed
    1. CHONG J, SOUFAN O, LI C, CARAUS I, LI S, BOURQUE G, WISHART DS & XIA J 2018. MetaboAnalyst 4.0: towards more transparent and integrative metabolomics analysis. Nucleic Acids Res, 46, W486–W494. - PMC - PubMed

Publication types

Substances

LinkOut - more resources