Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jun 20;9(1):bpae028.
doi: 10.1093/biomethods/bpae028. eCollection 2024.

Early detection and diagnosis of cancer with interpretable machine learning to uncover cancer-specific DNA methylation patterns

Affiliations

Early detection and diagnosis of cancer with interpretable machine learning to uncover cancer-specific DNA methylation patterns

Izzy Newsham et al. Biol Methods Protoc. .

Abstract

Cancer, a collection of more than two hundred different diseases, remains a leading cause of morbidity and mortality worldwide. Usually detected at the advanced stages of disease, metastatic cancer accounts for 90% of cancer-associated deaths. Therefore, the early detection of cancer, combined with current therapies, would have a significant impact on survival and treatment of various cancer types. Epigenetic changes such as DNA methylation are some of the early events underlying carcinogenesis. Here, we report on an interpretable machine learning model that can classify 13 cancer types as well as non-cancer tissue samples using only DNA methylome data, with 98.2% accuracy. We utilize the features identified by this model to develop EMethylNET, a robust model consisting of an XGBoost model that provides information to a deep neural network that can generalize to independent data sets. We also demonstrate that the methylation-associated genomic loci detected by the classifier are associated with genes, pathways and networks involved in cancer, providing insights into the epigenomic regulation of carcinogenesis.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1.
Figure 1.
Overview of method. DNA methylation microarray data from 13 cancer types and corresponding normal tissues were collected from TCGA and preprocessed. For binary and multiclass classification tasks, three types of models were trained: Simple models (logistic regression and support vector machines), XGBoost, and EMethylNET, a model consisting of XGBoost combined with a Deep Neural Network. Then the models were evaluated on independent data sets and an analysis of their features used in classification was performed
Figure 2.
Figure 2.
Performance of the binary and multiclass XGBoost models on the TCGA test set. a and b Confusion matrices of the best (KIRC) and worst (ESCA) performing binary XGBoost models. c AUC of the ROC curves for all binary XGBoost models. d AUC of the Precision Recall (PR) curves for both cancer and normal classes of all binary XGBoost models. Note that the scales of c and d start from 0.7. e MCC scores for all binary XGBoost models. f shows the confusion matrix, g shows the AUC of the ROC curves for each class, and h shows the AUC of the Precision Recall (PR) curves for each class of the multiclass XGBoost model. Note that the scales of g and h start from 0.9
Figure 3.
Figure 3.
Performance of the binary XGBoost models on independent data sets. a and b Confusion matrices of the best (BRCA) and worst (COAD) performing binary XGBoost models (according to the ROC AUC scores) on the independent data sets. c Detailed confusion matrix for COAD showing the predictions of Normal (N), Adenoma (A), and Cancer (C) samples. d AUC of the ROC curves for binary XGBoost models where the independent data set included normal samples. e AUC of the Precision Recall (PR) curves for both cancer and normal (where available) classes of binary XGBoost models on the independent data sets. f MCC scores for binary XGBoost models where the independent data set included normal samples. For d, e and f, ESCA is the average of the two ESCA independent data sets
Figure 4.
Figure 4.
Architecture of the feed forward neural network (a) and its performance on all independent data sets. b shows the confusion matrix, c shows the AUC of the ROC curves for each class, and d shows the AUC of the Precision Recall (PR) curves for each class. The two ESCA data sets are combined into one ESCA class. The colour orange denotes normal and purple denotes cancer. Note that we do not have independent data sets for every cancer type (the independent data sets used lacked BLCA, KIRP, LUAD, LUSC and UCEC samples). Nevertheless, for the confusion matrix in b all 14 classes are retained in the rows to maintain a square configuration, enhancing readability
Figure 5
Figure 5
Cancer processes, genes, and pathways in the multiclass gene list. a A REVIGO visualization showing the significant Gene Ontology terms, restricted to the biological process domain. Only a small selection of terms is labelled. b The 20 multiclass genes found most often in abstracts about cancer. Colour indicates the number of abstracts also specifying a tissue. c A visualization of the significant KEGG pathways, where the size of the node (pathway) is the amount of overlap between the multiclass gene list and the pathway, and the width of the edge indicates the amount of overlap between the two pathways. d The Pathways in cancer KEGG pathway, showing only multiclass genes. Each multiclass gene is coloured by the difference in average methylation between cancer and normal for two cancer types: BLCA and PRAD
Figure 6
Figure 6
A network of cancer pathways and the multiclass genes. Each circle of nodes is a cancer pathway, and each node represents a multiclass gene. The node colour represents the number of times each multiclass gene is displayed (as they can be in multiple pathways), the edge thickness represents the number of interactions between pathways, and a black outline indicates that the multiclass gene is found in the Cancer Gene Census. The colour of the pathway name represents the pathway category
Figure 7
Figure 7
Analysis of the lncRNAs found in the gene lists. a The fractions of different gene types in all cancer gene lists, including the multiclass gene list. b A heatmap of BRCA data showing the average beta value of the multiclass lncRNAs with literature evidence, and the cancer hallmarks they are associated with. The row annotation indicates the log fold change from differential expression analysis, where non-significant fold change (adjusted P-value > 0.05) is in grey. c The top 10 multiclass lncRNAs that had the most literature evidence. d The significance levels resulting from testing the multiclass lncRNAs for previously observed cancer lncRNA features [41]. The dashed red line indicates the P-value = .05 level of significance. e Boxplot of the loge gene length of non-multiclass lncRNAs and multiclass lncRNAs. ‘***’ indicates P-value < .001
Figure 8
Figure 8
Survival analysis using the gene lists from the binary models. a The two most significant Kaplan-Meier curves that differentiate survival: HNSC (P-value: 3.15x10-16) and KIRC (P-value: 3.06x10-15). b The distribution of ROC AUCs when predicting 5-year survival for cancer types with sufficient survival data. Colour represents the three different variations of input variables to the survival models. c The best ROC curves for predicting 5-year survival of the cancer types with the highest average ROC AUC: KIRC and COAD

References

    1. IARC. "Globocan: All Cancers Fact Sheet." https://gco.iarc.who.int/media/globocan/factsheets/cancers/39-all-cancer... (accessed 24.08.23, 2023).
    1. Baylin SB, Jones PA.. A decade of exploring the cancer epigenome—biological and translational implications. Nat Rev Cancer 2011;11:726–34. 10.1038/nrc3130. - DOI - PMC - PubMed
    1. Gonzalez-Zulueta M, Bender CM, Yang AS. et al. Methylation of the 5' CpG island of the p16/CDKN2 tumor suppressor gene in normal and transformed human tissues correlates with gene silencing. Cancer Res 1995;55:4531–5. [Online]. Available: https://www.ncbi.nlm.nih.gov/pubmed/7553622. - PubMed
    1. Greger V, Debus N, Lohmann D. et al. Frequency and parental origin of hypermethylated RB1 alleles in retinoblastoma. Hum Genet 1994;94:491–6. 10.1007/BF00211013. - DOI - PubMed
    1. Herman JG, Latif F, Weng Y. et al. Silencing of the VHL tumor-suppressor gene by DNA methylation in renal carcinoma. Proc Natl Acad Sci U S A 1994;91:9700–4. 10.1073/pnas.91.21.9700. - DOI - PMC - PubMed