Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Feb 11:9:92.
doi: 10.1186/1471-2105-9-92.

Microarray data mining using landmark gene-guided clustering

Affiliations

Microarray data mining using landmark gene-guided clustering

Pankaj Chopra et al. BMC Bioinformatics. .

Abstract

Background: Clustering is a popular data exploration technique widely used in microarray data analysis. Most conventional clustering algorithms, however, generate only one set of clusters independent of the biological context of the analysis. This is often inadequate to explore data from different biological perspectives and gain new insights. We propose a new clustering model that can generate multiple versions of different clusters from a single dataset, each of which highlights a different aspect of the given dataset.

Results: By applying our SigCalc algorithm to three yeast Saccharomyces cerevisiae datasets we show two results. First, we show that different sets of clusters can be generated from the same dataset using different sets of landmark genes. Each set of clusters groups genes differently and reveals new biological associations between genes that were not apparent from clustering the original microarray expression data. Second, we show that many of these new found biological associations are common across datasets. These results also provide strong evidence of a link between the choice of landmark genes and the new biological associations found in gene clusters.

Conclusion: We have used the SigCalc algorithm to project the microarray data onto a completely new subspace whose co-ordinates are genes (called landmark genes), known to belong to a Biological Process. The projected space is not a true vector space in mathematical terms. However, we use the term subspace to refer to one of virtually infinite numbers of projected spaces that our proposed method can produce. By changing the biological process and thus the landmark genes, we can change this subspace. We have shown how clustering on this subspace reveals new, biologically meaningful clusters which were not evident in the clusters generated by conventional methods. The R scripts (source code) are freely available under the GPL license. The source code is available [see Additional File 1] as additional material, and the latest version can be obtained at http://www4.ncsu.edu/~pchopra/landmarks.html. The code is under active development to incorporate new clustering methods and analysis.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Comparison of microarray expression data with gene signatures for genes that clustered together using gene signatures. Gasch dataset: Genes associated with multi-organism process (GO:0051704) were clustered together.
Figure 2
Figure 2
Comparison of microarray expression data with gene signatures for genes that clustered together using gene signatures. Gasch dataset: Genes associated with reproduction (GO:0000003) were clustered together.
Figure 3
Figure 3
Number of GO terms for varying number of clusters. For each landmark, a number of unique GO terms are found irrespective of the number of clusters.
Figure 4
Figure 4
Comparison of unique GO terms found using gene signatures versus those found using semi-supervized clustering (SSC) for the Spellman and Gasch datasets. For the semi-supervized clustering (SSC), the landmark genes were considered as 'must-link' constraints. SSC1 denotes the number of unique GO terms found by using landmark genes as constraints in SSC. GSM1 denotes the number of unique GO terms found by using the gene signature model. SSC2 denotes the number of unique GO terms found for SSC if we remove the largest cluster (containing all the landmark genes) from analysis. GSM2 denotes the number of unique GO terms found using the gene signature model if we remove the largest cluster from analysis. The results for other landmarks are shown in Figure 3 in Additional File 2.
Figure 5
Figure 5
Comparison of gene expression patterns in the largest cluster of semi-supervized clustering (SSC) versus the gene signature model (GSM) for the Gasch dataset using landmark genes associated with 'proteolysis'.
Figure 6
Figure 6
Microarray expression data matrix. The selected landmark genes are highlighted.
Figure 7
Figure 7
Gene signatures derived from microarray data using SigCalc. Gene signature matrix, where each row represents a gene signature.
Figure 8
Figure 8
Significant GO terms in microarray data. The dots indicate Significant GO terms found by performing clustering on microarray data (i.e., original GO terms).
Figure 9
Figure 9
Significant GO terms in microarray data and in gene signatures. Shows a comparison of Significant GO terms found by clustering gene signatures (i.e., landmark GO terms) with the original GO terms.

Similar articles

Cited by

References

    1. Jiang D, Tang C, Zhang A. Cluster Analysis for Gene Expression Data: A Survey. IEEE Transactions on Knowledge and Data Engineering. 2004;16:1370–1386.
    1. Handl J, Knowles J, Kell DB. Computational cluster validation in post-genomic data analysis. Bioinformatics. 2005;21:3201–3212. - PubMed
    1. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM. Systematic determination of genetic network architecture. Nat Genet. 1999;22:281–285. http://dx.doi.org/10.1038/10343 - DOI - PubMed
    1. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. PNAS. 1998;95:14863–14868. - PMC - PubMed
    1. Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR. Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. PNAS. 1999;96:2907–2912. http://www.pnas.org/cgi/content/abstract/96/6/2907 - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources