. 2015 Apr 1;10(4):e0119448.

doi: 10.1371/journal.pone.0119448. eCollection 2015.

Analyzing large gene expression and methylation data profiles using StatBicRM: statistical biclustering-based rule mining

Ujjwal Maulik¹, Saurav Mallik², Anirban Mukhopadhyay³, Sanghamitra Bandyopadhyay²

Affiliations

¹ Department of Computer Science and Engineering, Jadavpur University, Kolkata, West Bengal, India.
² Machine Intelligence Unit, Indian Statistical Institute, Kolkata, West Bengal, India.
³ Department of Computer Science and Engineering, University of Kalyani, Kalyani, West Bengal, India.

PMID: 25830807
PMCID: PMC4382191
DOI: 10.1371/journal.pone.0119448

Analyzing large gene expression and methylation data profiles using StatBicRM: statistical biclustering-based rule mining

Ujjwal Maulik et al. PLoS One. 2015.

. 2015 Apr 1;10(4):e0119448.

doi: 10.1371/journal.pone.0119448. eCollection 2015.

Authors

Ujjwal Maulik¹, Saurav Mallik², Anirban Mukhopadhyay³, Sanghamitra Bandyopadhyay²

Affiliations

¹ Department of Computer Science and Engineering, Jadavpur University, Kolkata, West Bengal, India.
² Machine Intelligence Unit, Indian Statistical Institute, Kolkata, West Bengal, India.
³ Department of Computer Science and Engineering, University of Kalyani, Kalyani, West Bengal, India.

PMID: 25830807
PMCID: PMC4382191
DOI: 10.1371/journal.pone.0119448

Abstract

Microarray and beadchip are two most efficient techniques for measuring gene expression and methylation data in bioinformatics. Biclustering deals with the simultaneous clustering of genes and samples. In this article, we propose a computational rule mining framework, StatBicRM (i.e., statistical biclustering-based rule mining) to identify special type of rules and potential biomarkers using integrated approaches of statistical and binary inclusion-maximal biclustering techniques from the biological datasets. At first, a novel statistical strategy has been utilized to eliminate the insignificant/low-significant/redundant genes in such way that significance level must satisfy the data distribution property (viz., either normal distribution or non-normal distribution). The data is then discretized and post-discretized, consecutively. Thereafter, the biclustering technique is applied to identify maximal frequent closed homogeneous itemsets. Corresponding special type of rules are then extracted from the selected itemsets. Our proposed rule mining method performs better than the other rule mining algorithms as it generates maximal frequent closed homogeneous itemsets instead of frequent itemsets. Thus, it saves elapsed time, and can work on big dataset. Pathway and Gene Ontology analyses are conducted on the genes of the evolved rules using David database. Frequency analysis of the genes appearing in the evolved rules is performed to determine potential biomarkers. Furthermore, we also classify the data to know how much the evolved rules are able to describe accurately the remaining test (unknown) data. Subsequently, we also compare the average classification accuracy, and other related factors with other rule-based classifiers. Statistical significance tests are also performed for verifying the statistical relevance of the comparative results. Here, each of the other rule mining methods or rule-based classifiers is also starting with the same post-discretized data-matrix. Finally, we have also included the integrated analysis of gene expression and methylation for determining epigenetic effect (viz., effect of methylation) on gene expression level.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Fig 1. Flowchart of the proposed methodology (*StatBicRM*) for the rule mining.**
Here, the terms *TOTALDESET* _N, *TOTALDESET* _NN, *TOTALDESET* _N+NN are described in last paragraph of subsection *“Identification of differentially expressed/methylated genes using Statistical tests”*. For methylation dataset, the above terms are replaced by *TOTALDMSET* _N, *TOTALDMSET* _NN, *TOTALDMSET* _N+NN, respectively.

**Fig 2. Flowchart of the proposed methodology (*StatBicRM*) for the classification.**
Here, the terms *TOTALDESET* _N, *TOTALDESET* _NN, *TOTALDESET* _N+NN are described in last paragraph of subsection. For methylation dataset, the above terms are replaced by *TOTALDMSET* _N, *TOTALDMSET* _NN, *TOTALDMSET* _N+NN, respectively.

**Fig 3. An example of generating special rules from data matrix of the differentially expressed genes.**
Here, up-regulation (i.e., ‘+’) and down-regulation (‘-’) are denoted by ‘1’ and ‘0’ in (b), and red and green colors in (c), respectively. Here, s _tr and s _nr denote experimental/diseased/treated and control/normal samples respectively.

**Fig 4. An example of classification of evolved rules by the majority voting using weighted-sum.**
Here, ‘r’ and ‘w’ denote rank and weight of the rule (computed by Equation 18), respectively. Tickmark/crossmark in ‘Q’ column states that test-point (ts) is satisfied/non-satisfied by the corresponding rule.

**Fig 5. The clustergram of the common differentially expressed genes (by different statistical tests) for DS1.**
Here, red colour denotes up-regulation of genes across the specific samples/conditions, and green colour denotes down-regulation of genes across the specific samples/conditions.

**Fig 6. Volcanoplot for identifying differential up and down-regulated genes from Dataset 1 by SAM.**

**Fig 7. A graphical representation of the gene expression of a maximal homogeneous bicluster (i.e., a *MFCHOI*) over different samples.**

Fig 8. Barcharts: (a) comparison of dataset-wise average accuracies, and (b) comparison of dataset-wise average MCCs, among our proposed and other existing rule-based classifiers for the four datasets.

Fig 9. Boxplots of significance tests (i.e., one-way Anova) for identifying level of significances (i.e., p-values) of accuracies between the proposed and other rule-based classifiers (pairwise) for Dataset 1 [in (a).(i-vi)], Dataset 2 [in (b).(i-vi)], Dataset 3 [in (c).(i-vi)] and Dataset 4 [in (d).(i-vi)]; where (i) proposed vs ConjunctiveRule, (ii) proposed vs DecisionTable, (iii) proposed vs JRip, (iv) proposed vs OneR, (v) proposed vs PART and (vi) proposed vs Ridor; (here vertical axis denotes the accuracy of the classifier).

**Fig 10. Two examples of how significant biomarkers are identified from the maximal homogeneous biclusters (i.e., *MFCHOI*) for each class-label for each dataset.**
Here, we are shown intersection of only four maximal homogeneous biclusters for (a) the class-label AC and (b) the class-label *SCC*, individually (for Dataset 1). For the class AC, CENPA-, TTK-, KIF11-, KIF18B- and ZNF367- are the top frequent genes as they exist in the four biclusters (see (a)); similarly, for the class *SCC*, SHROOM3- is top frequent gene as it exists in the four biclusters (see (b)).

**Fig 11. Comparison of number of significant itemsets between *StatBicRM* and other existing ARM methods at different minimum support for the two artificial datasets (viz., *ArDS*5 and *ArDS*6).**
“Significant itemset” refers to *MFCHOI* for *StatBicRM*, and FI for the other methods.

See this image and copyright information in PMC

Cited by

Molecular signatures identified by integrating gene expression and methylation in non-seminoma and seminoma of testicular germ cell tumours.
Mallik S, Qin G, Jia P, Zhao Z. Mallik S, et al. Epigenetics. 2021 Jan-Feb;16(2):162-176. doi: 10.1080/15592294.2020.1790108. Epub 2020 Jul 13. Epigenetics. 2021. PMID: 32615059 Free PMC article.
Coordinated medical care for children with neurofibromatosis type 1 and related RASopathies in Poland.
Karwacki MW, Wysocki M, Perek-Polnik M, Jatczak-Gaca A. Karwacki MW, et al. Arch Med Sci. 2019 May 17;17(5):1221-1231. doi: 10.5114/aoms.2019.85143. eCollection 2021. Arch Med Sci. 2019. PMID: 34522251 Free PMC article.
Detecting TF-miRNA-gene network based modules for 5hmC and 5mC brain samples: a intra- and inter-species case-study between human and rhesus.
Maulik U, Sen S, Mallik S, Bandyopadhyay S. Maulik U, et al. BMC Genet. 2018 Jan 22;19(1):9. doi: 10.1186/s12863-017-0574-7. BMC Genet. 2018. PMID: 29357837 Free PMC article.
Optimal ranking and directional signature classification using the integral strategy of multi-objective optimization-based association rule mining of multi-omics data.
Mallik S, Seth S, Si A, Bhadra T, Zhao Z. Mallik S, et al. Front Bioinform. 2023 Jul 27;3:1182176. doi: 10.3389/fbinf.2023.1182176. eCollection 2023. Front Bioinform. 2023. PMID: 37576714 Free PMC article.
3PNMF-MKL: A non-negative matrix factorization-based multiple kernel learning method for multi-modal data integration and its application to gene signature detection.
Mallik S, Sarkar A, Nath S, Maulik U, Das S, Pati SK, Ghosh S, Zhao Z. Mallik S, et al. Front Genet. 2023 Feb 14;14:1095330. doi: 10.3389/fgene.2023.1095330. eCollection 2023. Front Genet. 2023. PMID: 36865387 Free PMC article.

See all "Cited by" articles

References

1. Bandyopadhyay S, Maulik U, Wang J. Analysis of Biological Data: A Soft Computing Approach World Scientific, Singapore; 2007.
1. Maulik U. Analysis of gene microarray data in a soft computing framework. Applied Soft Computing 2011; 11: 4152–4160. 10.1016/j.asoc.2011.03.004 - DOI
1. Maulik U, Bandyopadhyay S, Wang J. Computational Intelligence and Pattern Analysis in Biological Informatics. Wiley, Singapore; 2010.
1. Mallik S, Mukhopadhyay A, Maulik U, Bandyopadhyay S. Integrated analysis gene expression and genome-wide DNA methylation for tumor prediction: An association rule mining-based approach In: Proceedings IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), IEEE Symposium Series on Computational Intelligence (SSCI), Singapore: 2013.
1. Dudoit S, Yang Y, Speed T, Callow M. Statistical methods for identifying differentially expressed genes in replicated cdna microarray experiments. Statistica Sinica 2002; 12: 111–139.

MeSH terms

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Analyzing large gene expression and methylation data profiles using StatBicRM: statistical biclustering-based rule mining

Affiliations

Analyzing large gene expression and methylation data profiles using StatBicRM: statistical biclustering-based rule mining

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources