Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jan 5;17(1):337-347.
doi: 10.1021/acs.jproteome.7b00595. Epub 2017 Nov 27.

Deep Learning Accurately Predicts Estrogen Receptor Status in Breast Cancer Metabolomics Data

Affiliations

Deep Learning Accurately Predicts Estrogen Receptor Status in Breast Cancer Metabolomics Data

Fadhl M Alakwaa et al. J Proteome Res. .

Abstract

Metabolomics holds the promise as a new technology to diagnose highly heterogeneous diseases. Conventionally, metabolomics data analysis for diagnosis is done using various statistical and machine learning based classification methods. However, it remains unknown if deep neural network, a class of increasingly popular machine learning methods, is suitable to classify metabolomics data. Here we use a cohort of 271 breast cancer tissues, 204 positive estrogen receptor (ER+), and 67 negative estrogen receptor (ER-) to test the accuracies of feed-forward networks, a deep learning (DL) framework, as well as six widely used machine learning models, namely random forest (RF), support vector machines (SVM), recursive partitioning and regression trees (RPART), linear discriminant analysis (LDA), prediction analysis for microarrays (PAM), and generalized boosted models (GBM). DL framework has the highest area under the curve (AUC) of 0.93 in classifying ER+/ER- patients, compared to the other six machine learning algorithms. Furthermore, the biological interpretation of the first hidden layer reveals eight commonly enriched significant metabolomics pathways (adjusted P-value <0.05) that cannot be discovered by other machine learning methods. Among them, protein digestion and absorption and ATP-binding cassette (ABC) transporters pathways are also confirmed in integrated analysis between metabolomics and gene expression data in these samples. In summary, deep learning method shows advantages for metabolomics based breast cancer ER status classification, with both the highest prediction accuracy (AUC = 0.93) and better revelation of disease biology. We encourage the adoption of feed-forward networks based deep learning method in the metabolomics research community for classification.

Keywords: bioinformatics; breast cancer; deep learning; estrogen receptor; metabolomics.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interest.

Figures

Figure 1
Figure 1
Block diagram of the proposed system. The first step is the preprocessing (log transformation, centering, autoscaling, and quantile normalization). We used Autoencoder pretraining (unsupervised step) to initial model weights and select model architecture. Model used the 80% of data split to train the model and the remaining 20% to measure model performance. The data were split 10 times to avoid the bias of data sampling, and the average AUC was calculated on the 10 hold out test sets.
Figure 2
Figure 2
(A) Average AUC on 10 hold out test sets of the DL framework against six machine learning algorithms for prediction of ER status from metabolomics data: recursive partitioning and regression trees (RPART) (0.83), linear discriminant analysis (LDA) (0.74), support vector machine (SVM) (0.89), deep learning (DL) (0.93), random forest (RF) (0.89), generalized boosted models (GBM) (0.89), and prediction analysis for microarrays (PAM) (0.88). The above algorithms were run 10 times on different train/test splits. We used pairwise Wilcoxon signed-rank test to estimate the statistical significance of the difference in performance between DL and other methods (∗∗ p < 0.01, ∗ p < 0.1). (B) Bipartite graph of the top 20 important metabolites extracted from DL model and other machine learning algorithms. Large nodes represent the models and small nodes are metabolites. A connection between metabolite and the model means this metabolite is one of the top 20 high importance metabolites extracted by this model.
Figure 3
Figure 3
Biological relevance of the DL hidden layers. (A) Activation levels of the high variance nodes extracted from the layer 1 of the DL model. Columns are samples and rows are the top 12 nodes with high variance >0.1. (B) Bipartite graph of enriched significant metabolomics pathways and top hidden nodes. The nodes represent enriched pathways common to all top 12 nodes (green color) in the first hidden layer of DL in KEGG pathway enrichment analysis (FDR< 0.05).
Figure 4
Figure 4
Joint pathway analysis between the top 20 DL metabolites and the highly differentiated enzymes. Only significant pathways with at least five overlapping metabolites are shown. X-axis shows the number of overlapped metabolites with the number of genes (number in parentheses) involved in the same pathway, y-axis shows the adjusted joint P-value calculated from IMPALA tool. The size of the nodes represents the size of metabolomic pathway (number of metabolites involved in that pathway). The color of the nodes represents the database source of these pathways.
Figure 5
Figure 5
Circos plot of Spearman’s correlation values between top 20 DL metabolites and highly differentiated enzymes with cutoff = |0.35|.
Figure 6
Figure 6
Beta-alanine and ABAT interaction network. (A) Metabolite level of beta-alanine and expression of ABAT. (B) Beta-alanine-ABAT interaction network in ER– breast cancer tissues compared to ER+ breast cancer tissues. MetScape, a Cytoscape plug-in, was used to integrate ER+/ER– metabolomics and gene expression data (GSE59198) of the same patients. Fold change of metabolites (hexagon nodes) or enzymes (circle nodes) are represented by the size of the nodes. The input of MetScape are the top 20 metabolites from the DL model and the 898 genes whose expression values are statistically significantly different between ER– and ER+ samples. Enzymes and metabolites with significant difference are marked by green line(s) on the shapes.

References

    1. Breast Cancer: Prevention and Control; World Health Organization, 2017. http://www.who.int/cancer/detection/breastcancer/en/index1.html (accessed October 10, 2017).
    1. About Breast Cancer; American Cancer Society, 2017. https://www.cancer.org/cancer/breast-cancer/about/how-common-is-breast-c... (accessed September 21, 2017).
    1. Carey L. A.; Perou C. M.; Livasy C. A.; Dressler L. G.; Cowan D.; Conway K.; Karaca G.; Troester M. A.; Tse C. K.; Edmiston S.; Deming S. L.; Geradts J.; Cheang M. C.; Nielsen T. O.; Moorman P. G.; Earp H. S.; Millikan R. C. Race, breast cancer subtypes, and survival in the Carolina Breast Cancer Study. JAMA 2006, 295 (21), 2492–2502. 10.1001/jama.295.21.2492. - DOI - PubMed
    1. O’Brien K. M.; Cole S. R.; Tse C. K.; Perou C. M.; Carey L. A.; Foulkes W. D.; Dressler L. G.; Geradts J.; Millikan R. C. Intrinsic breast tumor subtypes, race, and long-term survival in the Carolina Breast Cancer Study. Clin. Cancer Res. 2010, 16 (24), 6100–6110. 10.1158/1078-0432.CCR-10-1533. - DOI - PMC - PubMed
    1. Haque R.; Ahmed S. A.; Inzhakova G.; Shi J.; Avila C.; Polikoff J.; Bernstein L.; Enger S. M.; Press M. F. Impact of breast cancer subtypes and treatment on survival: an analysis spanning two decades. Cancer Epidemiol., Biomarkers Prev. 2012, 21 (10), 1848–1855. 10.1158/1055-9965.EPI-12-0474. - DOI - PMC - PubMed

Publication types

Substances