Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Apr 1;10(4):e0119448.
doi: 10.1371/journal.pone.0119448. eCollection 2015.

Analyzing large gene expression and methylation data profiles using StatBicRM: statistical biclustering-based rule mining

Affiliations

Analyzing large gene expression and methylation data profiles using StatBicRM: statistical biclustering-based rule mining

Ujjwal Maulik et al. PLoS One. .

Abstract

Microarray and beadchip are two most efficient techniques for measuring gene expression and methylation data in bioinformatics. Biclustering deals with the simultaneous clustering of genes and samples. In this article, we propose a computational rule mining framework, StatBicRM (i.e., statistical biclustering-based rule mining) to identify special type of rules and potential biomarkers using integrated approaches of statistical and binary inclusion-maximal biclustering techniques from the biological datasets. At first, a novel statistical strategy has been utilized to eliminate the insignificant/low-significant/redundant genes in such way that significance level must satisfy the data distribution property (viz., either normal distribution or non-normal distribution). The data is then discretized and post-discretized, consecutively. Thereafter, the biclustering technique is applied to identify maximal frequent closed homogeneous itemsets. Corresponding special type of rules are then extracted from the selected itemsets. Our proposed rule mining method performs better than the other rule mining algorithms as it generates maximal frequent closed homogeneous itemsets instead of frequent itemsets. Thus, it saves elapsed time, and can work on big dataset. Pathway and Gene Ontology analyses are conducted on the genes of the evolved rules using David database. Frequency analysis of the genes appearing in the evolved rules is performed to determine potential biomarkers. Furthermore, we also classify the data to know how much the evolved rules are able to describe accurately the remaining test (unknown) data. Subsequently, we also compare the average classification accuracy, and other related factors with other rule-based classifiers. Statistical significance tests are also performed for verifying the statistical relevance of the comparative results. Here, each of the other rule mining methods or rule-based classifiers is also starting with the same post-discretized data-matrix. Finally, we have also included the integrated analysis of gene expression and methylation for determining epigenetic effect (viz., effect of methylation) on gene expression level.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Flowchart of the proposed methodology (StatBicRM) for the rule mining.
Here, the terms TOTALDESET N, TOTALDESET NN, TOTALDESET N+NN are described in last paragraph of subsection “Identification of differentially expressed/methylated genes using Statistical tests”. For methylation dataset, the above terms are replaced by TOTALDMSET N, TOTALDMSET NN, TOTALDMSET N+NN, respectively.
Fig 2
Fig 2. Flowchart of the proposed methodology (StatBicRM) for the classification.
Here, the terms TOTALDESET N, TOTALDESET NN, TOTALDESET N+NN are described in last paragraph of subsection. For methylation dataset, the above terms are replaced by TOTALDMSET N, TOTALDMSET NN, TOTALDMSET N+NN, respectively.
Fig 3
Fig 3. An example of generating special rules from data matrix of the differentially expressed genes.
Here, up-regulation (i.e., ‘+’) and down-regulation (‘-’) are denoted by ‘1’ and ‘0’ in (b), and red and green colors in (c), respectively. Here, s tr and s nr denote experimental/diseased/treated and control/normal samples respectively.
Fig 4
Fig 4. An example of classification of evolved rules by the majority voting using weighted-sum.
Here, ‘r’ and ‘w’ denote rank and weight of the rule (computed by Equation 18), respectively. Tickmark/crossmark in ‘Q’ column states that test-point (ts) is satisfied/non-satisfied by the corresponding rule.
Fig 5
Fig 5. The clustergram of the common differentially expressed genes (by different statistical tests) for DS1.
Here, red colour denotes up-regulation of genes across the specific samples/conditions, and green colour denotes down-regulation of genes across the specific samples/conditions.
Fig 6
Fig 6. Volcanoplot for identifying differential up and down-regulated genes from Dataset 1 by SAM.
Fig 7
Fig 7. A graphical representation of the gene expression of a maximal homogeneous bicluster (i.e., a MFCHOI) over different samples.
Fig 8
Fig 8. Barcharts: (a) comparison of dataset-wise average accuracies, and (b) comparison of dataset-wise average MCCs, among our proposed and other existing rule-based classifiers for the four datasets.
Fig 9
Fig 9. Boxplots of significance tests (i.e., one-way Anova) for identifying level of significances (i.e., p-values) of accuracies between the proposed and other rule-based classifiers (pairwise) for Dataset 1 [in (a).(i-vi)], Dataset 2 [in (b).(i-vi)], Dataset 3 [in (c).(i-vi)] and Dataset 4 [in (d).(i-vi)]; where (i) proposed vs ConjunctiveRule, (ii) proposed vs DecisionTable, (iii) proposed vs JRip, (iv) proposed vs OneR, (v) proposed vs PART and (vi) proposed vs Ridor; (here vertical axis denotes the accuracy of the classifier).
Fig 10
Fig 10. Two examples of how significant biomarkers are identified from the maximal homogeneous biclusters (i.e., MFCHOI) for each class-label for each dataset.
Here, we are shown intersection of only four maximal homogeneous biclusters for (a) the class-label AC and (b) the class-label SCC, individually (for Dataset 1). For the class AC, CENPA-, TTK-, KIF11-, KIF18B- and ZNF367- are the top frequent genes as they exist in the four biclusters (see (a)); similarly, for the class SCC, SHROOM3- is top frequent gene as it exists in the four biclusters (see (b)).
Fig 11
Fig 11. Comparison of number of significant itemsets between StatBicRM and other existing ARM methods at different minimum support for the two artificial datasets (viz., ArDS5 and ArDS6).
“Significant itemset” refers to MFCHOI for StatBicRM, and FI for the other methods.

Similar articles

Cited by

References

    1. Bandyopadhyay S, Maulik U, Wang J. Analysis of Biological Data: A Soft Computing Approach World Scientific, Singapore; 2007.
    1. Maulik U. Analysis of gene microarray data in a soft computing framework. Applied Soft Computing 2011; 11: 4152–4160. 10.1016/j.asoc.2011.03.004 - DOI
    1. Maulik U, Bandyopadhyay S, Wang J. Computational Intelligence and Pattern Analysis in Biological Informatics. Wiley, Singapore; 2010.
    1. Mallik S, Mukhopadhyay A, Maulik U, Bandyopadhyay S. Integrated analysis gene expression and genome-wide DNA methylation for tumor prediction: An association rule mining-based approach In: Proceedings IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), IEEE Symposium Series on Computational Intelligence (SSCI), Singapore: 2013.
    1. Dudoit S, Yang Y, Speed T, Callow M. Statistical methods for identifying differentially expressed genes in replicated cdna microarray experiments. Statistica Sinica 2002; 12: 111–139.

LinkOut - more resources