Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Dec;5(4):302-327.
doi: 10.1007/s40484-017-0119-0. Epub 2017 Nov 23.

Towards integrated oncogenic marker recognition through mutual information-based statistically significant feature extraction: an association rule mining based study on cancer expression and methylation profiles

Affiliations

Towards integrated oncogenic marker recognition through mutual information-based statistically significant feature extraction: an association rule mining based study on cancer expression and methylation profiles

Saurav Mallik et al. Quant Biol. 2017 Dec.

Abstract

Background: Marker detection is an important task in complex disease studies. Here we provide an association rule mining (ARM) based approach for identifying integrated markers through mutual information (MI) based statistically significant feature extraction, and apply it to acute myeloid leukemia (AML) and prostate carcinoma (PC) gene expression and methylation profiles.

Methods: We first collect the genes having both expression and methylation values in AML as well as PC. Next, we run Jarque-Bera normality test on the expression/methylation data to divide the whole dataset into two parts: one that ollows normal distribution and the other that does not follow normal distribution. Thus, we have now four parts of the dataset: normally distributed expression data, normally distributed methylation data, non-normally distributed expression data, and non-normally distributed methylated data. A feature-extraction technique, "mRMR" is then utilized on each part. This results in a list of top-ranked genes. Next, we apply Welch t-test (parametric test) and Shrink t-test (non-parametric test) on the expression/methylation data for the top selected normally distributed genes and non-normally distributed genes, respectively. We then use a recent weighted ARM method, "RANWAR" to combine all/specific resultant genes to generate top oncogenic rules along with respective integrated markers. Finally, we perform literature search as well as KEGG pathway and Gene-Ontology (GO) analyses using Enrichr database for in silico validation of the prioritized oncogenes as the markers and labeling the markers as existing or novel.

Results: The novel markers of AML are {ABCB11↑∪KRT17↓} (i.e., ABCB11 as up-regulated, & KRT17 as down-regulated), and {AP1S1-∪KRT17↓∪NEIL2-∪DYDC1↓}) (i.e., AP1S1 and NEIL2 both as hypo-methylated, & KRT17 and DYDC1 both as down-regulated). The novel marker of PC is {UBIAD1¶∪APBA2‡∪C4orf31‡} (i.e., UBIAD1 as up-regulated and hypo-methylated, & APBA2 and C4orf31 both as down-regulated and hyper-methylated).

Conclusion: The identified novel markers might have critical roles in AML as well as PC. The approach can be applied to other complex disease.

Keywords: feature extraction; integrated markers; rule mining; statistical test.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Example of the step post-discretization in our proposed framework
“↑” and “↓” denote up-regulation and down-regulation successively, whereas “+” and “−” refer to hyper-methylation and hypo-methylation respectively. Here, sds and snor signify diseased and normal samples, successively. “nE” and “nM” denotes normalized expression and methylation profiles, respectively.
Figure 2
Figure 2
Flowchart of our proposed framework of identifying oncogenic rules through integrated study for multi-view dataset consisting of expression and methylation datasets.
Figure 3
Figure 3. Maximum score of feature selection (i.e., Fs_maxscore) at each iteration (up to 100 iterations) for NNE sub-dataset of AML dataset
Notably, horizontal axis denotes Fs_maxscore, whereas vertical axis represents interation ID of feature selection.
Figure 4
Figure 4. Maximum score of feature selection (i.e., Fs_maxscore) at each iteration (up to 100 iterations) for NNM sub-dataset of AML dataset
Notably, horizontal axis denotes Fs_maxscore, whereas vertical axis represents iteration ID of feature selection.
Figure 5
Figure 5. Maximum score of feature selection (i.e., Fs_maxscore) at each iteration (up to 100 iterations) for NNNE sub-dataset of AML dataset
Notably, horizontal axis denotes Fs_maxscore, whereas vertical axis represents iteration ID of feature selection.
Figure 6
Figure 6. Maximum score of feature selection (i.e., Fs_maxscore) at each iteration (up to 100 iterations) for NNNM sub-dataset of AML dataset
Notably, horizontal axis denotes Fs_maxscore, whereas vertical axis represents iteration ID of feature selection.
Figure 7
Figure 7. Heatmap for the resultant oncogenes for the AML expression dataset
Notably, horizontal axis denotes oncogenes, whereas vertical axis depicts samples.
Figure 8
Figure 8. Heatmap for the resultant oncogenes for the AML methylation dataset
Notably, horizontal axis denotes oncogenes, whereas vertical axis depicts samples.

Similar articles

Cited by

References

    1. Strimbu K, Tavel JA. What are biomarkers? Curr Opin HIV AIDS. 2010;5:463–466. - PMC - PubMed
    1. Dessì N, Pascariello E, Pes B. A comparative analysis of biomarker selection techniques. BioMed Res Int. 2013;2013:387673. - PMC - PubMed
    1. Maiorov EG, Keskin O, Ng OH, Ozbek U, Gursoy A. Identification of interconnected markers for T-cell acute lymphoblastic leukemia. Biomed Res Int. 2013;2013:210253. - PMC - PubMed
    1. Renneville A, Roumier C, Biggio V, Nibourel O, Boissel N, Fenaux P, Preudhomme C. Cooperating gene mutations in acute myeloid leukemia: a review of the literature. Leukemia. 2008;22:915–931. - PubMed
    1. Opgen-Rhein R, Strimmer K. Accurate ranking of differentially expressed genes by a distribution-free shrinkage approach. Stat Appl Genet Mol Biol. 2007;6:e9. - PubMed

LinkOut - more resources