Identifying Significant Features in Cancer Methylation Data Using Gene Pathway Segmentation

Zena M Hira¹, Duncan F Gillies¹

Affiliations

PMID: 27688706
PMCID: PMC5030825
DOI: 10.4137/CIN.S39859

Identifying Significant Features in Cancer Methylation Data Using Gene Pathway Segmentation

Zena M Hira et al. Cancer Inform. 2016.

. 2016 Sep 20:15:189-98.

doi: 10.4137/CIN.S39859. eCollection 2016.

Authors

Zena M Hira¹, Duncan F Gillies¹

Affiliation

¹ Department of Computing, Imperial College London, London, UK.

PMID: 27688706
PMCID: PMC5030825
DOI: 10.4137/CIN.S39859

Abstract

In order to provide the most effective therapy for cancer, it is important to be able to diagnose whether a patient's cancer will respond to a proposed treatment. Methylation profiling could contain information from which such predictions could be made. Currently, hypothesis testing is used to determine whether possible biomarkers for cancer progression produce statistically significant results. However, this approach requires the identification of individual genes, or sets of genes, as candidate hypotheses, and with the increasing size of modern microarrays, this task is becoming progressively harder. Exhaustive testing of small sets of genes is computationally infeasible, and so hypothesis generation depends either on the use of established biological knowledge or on heuristic methods. As an alternative machine learning, methods can be used to identify groups of genes that are acting together within sets of cancer data and associate their behaviors with cancer progression. These methods have the advantage of being multivariate and unbiased but unfortunately also rapidly become computationally infeasible as the number of gene probes and datasets increases. To address this problem, we have investigated a way of utilizing prior knowledge to segment microarray datasets in such a way that machine learning can be used to identify candidate sets of genes for hypothesis testing. A methylation dataset is divided into subsets, where each subset contains only the probes that relate to a known gene pathway. Each of these pathway subsets is used independently for classification. The classification method is AdaBoost with decision trees as weak classifiers. Since each pathway subset contains a relatively small number of gene probes, it is possible to train and test its classification accuracy quickly and determine whether it has valuable diagnostic information. Finally, genes from successful pathway subsets can be combined to create a classifier of high accuracy.

Keywords: cancer progression; machine learning; methylation profiling.

PubMed Disclaimer

Figures

**Figure 1**
Pathway algorithm: in the first step the original methylation dataset is split into several smaller subsets in which all the genes of one subset belong to one pathway in the ConsensusPath database. AdaBoost is applied on the subsets to build classifiers for disease progression. The classification accuracy of each subset was calculated using stratified cross-validation to account for unbalanced classes. Randomly picked subsets of the probes in the original dataset were created so that the pathway sets with the highest accuracies could be tested for significance using z-scores and P values. **Notes:** ¹http://globocan.iarc.fr/Default.aspx.⁶https://www.etriks.org/.

**Figure 2**
ROC curve for the prediction of disease progression using the complete LGG dataset.

**Figure 3**
ROC curve for the prediction of disease progression using the complete CML dataset.

**Figure 4**
ROC curves for the four pathway sets with the highest accuracy on the LGG dataset.

**Figure 5**
Comparison between *pantothenate and CoA biosynthesis* and *retinoate biosynthesis II* pathway sets.

**Figure 6**
Comparison between *pantothenate and CoA biosynthesis* and *activation of Rac* pathway sets.

**Figure 7**
The gene selection algorithm based on accuracy thresholds and how important each feature is when constructing the decision tree.

**Figure 8**
ROC curve for the *regulation of KIT signaling* pathway set.

**Figure 9**
Comparison between *regulation of KIT signaling* and *arrestins in gpcr desensitization* pathway sets.

**Figure 10**
Comparison between *regulation of KIT signaling* and *NF-kappa B signaling – Homo sapiens* pathway sets.

**Figure 11**
Comparison between *regulation of KIT signaling* and *acetylcholine synthesis* pathway sets.

See this image and copyright information in PMC

References

1. Ahmed Z, Smith BJ, Kotani K, Wilden P, Pillay TS. Aps, an adapter protein with a ph and sh2 domain, is a substrate for the insulin receptor kinase. Biochem J. 1999;341(pt 3):665–8. - PMC - PubMed
1. Ferlay J, Soerjomataram I, Dikshit R, et al. Cancer incidence and mortality worldwide: sources, methods and major patterns in globocan 2012. Int J Cancer. 2015;136(5):E359–86. - PubMed
1. Schumacher A, Kapranov P, Kaminsky Z, et al. Microarray-based DNA methylation profiling: technology and applications. Nucleic Acids Res. 2006;34(2):528–42. - PMC - PubMed
1. Lyn Walker C, Ho SM. Developmental reprogramming of cancer susceptibility. Nat Rev Cancer. 2012;12(7):479–86. - PMC - PubMed
1. Jones PA, Laird PW. Cancer epigenetics comes of age. Nat Genet. 1999;21(2):163–7. - PubMed

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Identifying Significant Features in Cancer Methylation Data Using Gene Pathway Segmentation

Affiliation

Identifying Significant Features in Cancer Methylation Data Using Gene Pathway Segmentation

Authors

Affiliation

Abstract

Figures

References

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Research Materials