Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Sep 20:15:189-98.
doi: 10.4137/CIN.S39859. eCollection 2016.

Identifying Significant Features in Cancer Methylation Data Using Gene Pathway Segmentation

Affiliations

Identifying Significant Features in Cancer Methylation Data Using Gene Pathway Segmentation

Zena M Hira et al. Cancer Inform. .

Abstract

In order to provide the most effective therapy for cancer, it is important to be able to diagnose whether a patient's cancer will respond to a proposed treatment. Methylation profiling could contain information from which such predictions could be made. Currently, hypothesis testing is used to determine whether possible biomarkers for cancer progression produce statistically significant results. However, this approach requires the identification of individual genes, or sets of genes, as candidate hypotheses, and with the increasing size of modern microarrays, this task is becoming progressively harder. Exhaustive testing of small sets of genes is computationally infeasible, and so hypothesis generation depends either on the use of established biological knowledge or on heuristic methods. As an alternative machine learning, methods can be used to identify groups of genes that are acting together within sets of cancer data and associate their behaviors with cancer progression. These methods have the advantage of being multivariate and unbiased but unfortunately also rapidly become computationally infeasible as the number of gene probes and datasets increases. To address this problem, we have investigated a way of utilizing prior knowledge to segment microarray datasets in such a way that machine learning can be used to identify candidate sets of genes for hypothesis testing. A methylation dataset is divided into subsets, where each subset contains only the probes that relate to a known gene pathway. Each of these pathway subsets is used independently for classification. The classification method is AdaBoost with decision trees as weak classifiers. Since each pathway subset contains a relatively small number of gene probes, it is possible to train and test its classification accuracy quickly and determine whether it has valuable diagnostic information. Finally, genes from successful pathway subsets can be combined to create a classifier of high accuracy.

Keywords: cancer progression; machine learning; methylation profiling.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Pathway algorithm: in the first step the original methylation dataset is split into several smaller subsets in which all the genes of one subset belong to one pathway in the ConsensusPath database. AdaBoost is applied on the subsets to build classifiers for disease progression. The classification accuracy of each subset was calculated using stratified cross-validation to account for unbalanced classes. Randomly picked subsets of the probes in the original dataset were created so that the pathway sets with the highest accuracies could be tested for significance using z-scores and P values. Notes: 1http://globocan.iarc.fr/Default.aspx.6https://www.etriks.org/.
Figure 2
Figure 2
ROC curve for the prediction of disease progression using the complete LGG dataset.
Figure 3
Figure 3
ROC curve for the prediction of disease progression using the complete CML dataset.
Figure 4
Figure 4
ROC curves for the four pathway sets with the highest accuracy on the LGG dataset.
Figure 5
Figure 5
Comparison between pantothenate and CoA biosynthesis and retinoate biosynthesis II pathway sets.
Figure 6
Figure 6
Comparison between pantothenate and CoA biosynthesis and activation of Rac pathway sets.
Figure 7
Figure 7
The gene selection algorithm based on accuracy thresholds and how important each feature is when constructing the decision tree.
Figure 8
Figure 8
ROC curve for the regulation of KIT signaling pathway set.
Figure 9
Figure 9
Comparison between regulation of KIT signaling and arrestins in gpcr desensitization pathway sets.
Figure 10
Figure 10
Comparison between regulation of KIT signaling and NF-kappa B signaling – Homo sapiens pathway sets.
Figure 11
Figure 11
Comparison between regulation of KIT signaling and acetylcholine synthesis pathway sets.

References

    1. Ahmed Z, Smith BJ, Kotani K, Wilden P, Pillay TS. Aps, an adapter protein with a ph and sh2 domain, is a substrate for the insulin receptor kinase. Biochem J. 1999;341(pt 3):665–8. - PMC - PubMed
    1. Ferlay J, Soerjomataram I, Dikshit R, et al. Cancer incidence and mortality worldwide: sources, methods and major patterns in globocan 2012. Int J Cancer. 2015;136(5):E359–86. - PubMed
    1. Schumacher A, Kapranov P, Kaminsky Z, et al. Microarray-based DNA methylation profiling: technology and applications. Nucleic Acids Res. 2006;34(2):528–42. - PMC - PubMed
    1. Lyn Walker C, Ho SM. Developmental reprogramming of cancer susceptibility. Nat Rev Cancer. 2012;12(7):479–86. - PMC - PubMed
    1. Jones PA, Laird PW. Cancer epigenetics comes of age. Nat Genet. 1999;21(2):163–7. - PubMed