. 2008 Feb 11:9:92.

doi: 10.1186/1471-2105-9-92.

Microarray data mining using landmark gene-guided clustering

Pankaj Chopra¹, Jaewoo Kang, Jiong Yang, HyungJun Cho, Heenam Stanley Kim, Min-Goo Lee

Affiliations

PMID: 18267003
PMCID: PMC2262871
DOI: 10.1186/1471-2105-9-92

Microarray data mining using landmark gene-guided clustering

Pankaj Chopra et al. BMC Bioinformatics. 2008.

. 2008 Feb 11:9:92.

doi: 10.1186/1471-2105-9-92.

Authors

Pankaj Chopra¹, Jaewoo Kang, Jiong Yang, HyungJun Cho, Heenam Stanley Kim, Min-Goo Lee

Affiliation

¹ Dept. of Computer Science and Engineering, Korea University, Seoul, Korea. pchopra@ncsu.edu

PMID: 18267003
PMCID: PMC2262871
DOI: 10.1186/1471-2105-9-92

Abstract

Background: Clustering is a popular data exploration technique widely used in microarray data analysis. Most conventional clustering algorithms, however, generate only one set of clusters independent of the biological context of the analysis. This is often inadequate to explore data from different biological perspectives and gain new insights. We propose a new clustering model that can generate multiple versions of different clusters from a single dataset, each of which highlights a different aspect of the given dataset.

Results: By applying our SigCalc algorithm to three yeast Saccharomyces cerevisiae datasets we show two results. First, we show that different sets of clusters can be generated from the same dataset using different sets of landmark genes. Each set of clusters groups genes differently and reveals new biological associations between genes that were not apparent from clustering the original microarray expression data. Second, we show that many of these new found biological associations are common across datasets. These results also provide strong evidence of a link between the choice of landmark genes and the new biological associations found in gene clusters.

Conclusion: We have used the SigCalc algorithm to project the microarray data onto a completely new subspace whose co-ordinates are genes (called landmark genes), known to belong to a Biological Process. The projected space is not a true vector space in mathematical terms. However, we use the term subspace to refer to one of virtually infinite numbers of projected spaces that our proposed method can produce. By changing the biological process and thus the landmark genes, we can change this subspace. We have shown how clustering on this subspace reveals new, biologically meaningful clusters which were not evident in the clusters generated by conventional methods. The R scripts (source code) are freely available under the GPL license. The source code is available [see Additional File 1] as additional material, and the latest version can be obtained at http://www4.ncsu.edu/~pchopra/landmarks.html. The code is under active development to incorporate new clustering methods and analysis.

PubMed Disclaimer

Figures

**Figure 1**
**Comparison of microarray expression data with gene signatures for genes that clustered together using gene signatures**. Gasch dataset: Genes associated with *multi-organism process* (GO:0051704) were clustered together.

**Figure 2**
**Comparison of microarray expression data with gene signatures for genes that clustered together using gene signatures**. Gasch dataset: Genes associated with *reproduction* (GO:0000003) were clustered together.

**Figure 3**
**Number of GO terms for varying number of clusters**. For each landmark, a number of unique GO terms are found irrespective of the number of clusters.

**Figure 4**
**Comparison of unique GO terms found using gene signatures versus those found using semi-supervized clustering (SSC) for the Spellman and Gasch datasets**. For the semi-supervized clustering (SSC), the landmark genes were considered as 'must-link' constraints. *SSC1* denotes the number of unique GO terms found by using landmark genes as constraints in SSC. *GSM1* denotes the number of unique GO terms found by using the gene signature model. *SSC2* denotes the number of unique GO terms found for SSC if we remove the largest cluster (containing all the landmark genes) from analysis. *GSM2* denotes the number of unique GO terms found using the gene signature model if we remove the largest cluster from analysis. The results for other landmarks are shown in Figure 3 in Additional File 2.

**Figure 5**
Comparison of gene expression patterns in the largest cluster of semi-supervized clustering (SSC) versus the gene signature model (GSM) for the Gasch dataset using landmark genes associated with 'proteolysis'.

**Figure 6**
**Microarray expression data matrix**. The selected landmark genes are highlighted.

**Figure 7**
**Gene signatures derived from microarray data using SigCalc**. Gene signature matrix, where each row represents a gene signature.

**Figure 8**
**Significant GO terms in microarray data**. The dots indicate Significant GO terms found by performing clustering on microarray data (i.e., *original GO terms*).

**Figure 9**
**Significant GO terms in microarray data and in gene signatures**. Shows a comparison of Significant GO terms found by clustering gene signatures (i.e., *landmark GO terms*) with the *original GO terms*.

See this image and copyright information in PMC

Cited by

Analyzing miRNA co-expression networks to explore TF-miRNA regulation.
Bandyopadhyay S, Bhattacharyya M. Bandyopadhyay S, et al. BMC Bioinformatics. 2009 May 28;10:163. doi: 10.1186/1471-2105-10-163. BMC Bioinformatics. 2009. PMID: 19476620 Free PMC article.
Fuzzy c-means clustering with prior biological knowledge.
Tari L, Baral C, Kim S. Tari L, et al. J Biomed Inform. 2009 Feb;42(1):74-81. doi: 10.1016/j.jbi.2008.05.009. Epub 2008 May 24. J Biomed Inform. 2009. PMID: 18595779 Free PMC article.
Semi-supervised clustering methods.
Bair E. Bair E. Wiley Interdiscip Rev Comput Stat. 2013;5(5):349-361. doi: 10.1002/wics.1270. Wiley Interdiscip Rev Comput Stat. 2013. PMID: 24729830 Free PMC article.
Semi-supervised consensus clustering for gene expression data analysis.
Wang Y, Pan Y. Wang Y, et al. BioData Min. 2014 May 8;7:7. doi: 10.1186/1756-0381-7-7. eCollection 2014. BioData Min. 2014. PMID: 24920961 Free PMC article.
Improving cancer classification accuracy using gene pairs.
Chopra P, Lee J, Kang J, Lee S. Chopra P, et al. PLoS One. 2010 Dec 21;5(12):e14305. doi: 10.1371/journal.pone.0014305. PLoS One. 2010. PMID: 21200431 Free PMC article.

References

1. Jiang D, Tang C, Zhang A. Cluster Analysis for Gene Expression Data: A Survey. IEEE Transactions on Knowledge and Data Engineering. 2004;16:1370–1386.
1. Handl J, Knowles J, Kell DB. Computational cluster validation in post-genomic data analysis. Bioinformatics. 2005;21:3201–3212. - PubMed
1. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM. Systematic determination of genetic network architecture. Nat Genet. 1999;22:281–285. http://dx.doi.org/10.1038/10343 - DOI - PubMed
1. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. PNAS. 1998;95:14863–14868. - PMC - PubMed
1. Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR. Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. PNAS. 1999;96:2907–2912. http://www.pnas.org/cgi/content/abstract/96/6/2907 - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- Saccharomyces Genome Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Microarray data mining using landmark gene-guided clustering

Affiliation

Microarray data mining using landmark gene-guided clustering

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Molecular Biology Databases

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Molecular Biology Databases