Clustering of gene expression data: performance and similarity analysis

doi:10.1186/1471-2105-7-S4-S19

Comparative Study

. 2006 Dec 12;7 Suppl 4(Suppl 4):S19.

doi: 10.1186/1471-2105-7-S4-S19.

Clustering of gene expression data: performance and similarity analysis

Longde Yin¹, Chun-Hsi Huang, Jun Ni

Affiliations

PMID: 17217511
PMCID: PMC1780119
DOI: 10.1186/1471-2105-7-S4-S19

Comparative Study

Clustering of gene expression data: performance and similarity analysis

Longde Yin et al. BMC Bioinformatics. 2006.

. 2006 Dec 12;7 Suppl 4(Suppl 4):S19.

doi: 10.1186/1471-2105-7-S4-S19.

Authors

Longde Yin¹, Chun-Hsi Huang, Jun Ni

Affiliation

¹ Department of Computer Science & Engineering, University of Connecticut, Storrs, CT 06269, USA. yin@engr.uconn.edu

PMID: 17217511
PMCID: PMC1780119
DOI: 10.1186/1471-2105-7-S4-S19

Abstract

Background: DNA Microarray technology is an innovative methodology in experimental molecular biology, which has produced huge amounts of valuable data in the profile of gene expression. Many clustering algorithms have been proposed to analyze gene expression data, but little guidance is available to help choose among them. The evaluation of feasible and applicable clustering algorithms is becoming an important issue in today's bioinformatics research.

Results: In this paper we first experimentally study three major clustering algorithms: Hierarchical Clustering (HC), Self-Organizing Map (SOM), and Self Organizing Tree Algorithm (SOTA) using Yeast Saccharomyces cerevisiae gene expression data, and compare their performance. We then introduce Cluster Diff, a new data mining tool, to conduct the similarity analysis of clusters generated by different algorithms. The performance study shows that SOTA is more efficient than SOM while HC is the least efficient. The results of similarity analysis show that when given a target cluster, the Cluster Diff can efficiently determine the closest match from a set of clusters. Therefore, it is an effective approach for evaluating different clustering algorithms.

Conclusion: HC methods allow a visual, convenient representation of genes. However, they are neither robust nor efficient. The SOM is more robust against noise. A disadvantage of SOM is that the number of clusters has to be fixed beforehand. The SOTA combines the advantages of both hierarchical and SOM clustering. It allows a visual representation of the clusters and their structure and is not sensitive to noises. The SOTA is also more flexible than the other two clustering methods. By using our data mining tool, Cluster Diff, it is possible to analyze the similarity of clusters generated by different algorithms and thereby enable comparisons of different clustering methods.

PubMed Disclaimer

Figures

**Figure 1**
**Runtimes for SOTA and hierarchical**. For a large number of genes (>1000), SOTA is obviously faster than HC. However, for a relatively small number (<1000) of genes, the performance of the SOTA and that of HC method are similar.

**Figure 2**
**Runtime for SOM and SOTA**. The runtime of SOTA and SOM are proportional to the sample sizes, and the computation using SOTA is faster than the SOM.

**Figure 3**
**Clustering result of SOTA**. The size of the ratio of the circles is proportional to the amount of genes in that cluster. The patterns of the clusters appear on the right of the circles.

**Figure 4**
**Clustering results of SOM**. Each rectangle corresponds to a node of the map. The black thick line in the rectangle corresponds to the profile of the node, and the grey lines correspond to the profiles of the genes in that cluster. The black bars on the left of the profiles are proportional to the number of genes in the clusters.

**Figure 5**
**Screenshot of the *Cluster Diff* window**. The main window contains the file, view, and help buttons. In this figure, the left group (A) has 6 clusters, from A0 to A5; the right group (B) has 8 clusters, from B0 to B7. In each cluster, the column represents the dimension of the Microarray data and the row represents the gene's profile. The score is the measurement of similarity.

**Figure 6**
**Screenshot of cluster similarity analysis**. The similarity analysis results of clusters generated by SOTA and SOM. The matched parts are linked by lines in grey colour.

**Figure 7**
**Example of a good matched clusters**. The profiles of these two clusters have similar trends, meaning that most genes in the two clusters are similar.

**Figure 8**
**Example of a bad matched clusters**. Two clusters are mismatched, their trends are different.

See this image and copyright information in PMC

Cited by

Tumor Necrosis Factor Alpha and Insulin-Like Growth Factor 1 Induced Modifications of the Gene Expression Kinetics of Differentiating Skeletal Muscle Cells.
Meyer SU, Krebs S, Thirion C, Blum H, Krause S, Pfaffl MW. Meyer SU, et al. PLoS One. 2015 Oct 8;10(10):e0139520. doi: 10.1371/journal.pone.0139520. eCollection 2015. PLoS One. 2015. PMID: 26447881 Free PMC article.
Integrative Analysis of MicroRNA and mRNA Data Reveals an Orchestrated Function of MicroRNAs in Skeletal Myocyte Differentiation in Response to TNF-α or IGF1.
Meyer SU, Sass S, Mueller NS, Krebs S, Bauersachs S, Kaiser S, Blum H, Thirion C, Krause S, Theis FJ, Pfaffl MW. Meyer SU, et al. PLoS One. 2015 Aug 13;10(8):e0135284. doi: 10.1371/journal.pone.0135284. eCollection 2015. PLoS One. 2015. PMID: 26270642 Free PMC article.
Reconstruct modular phenotype-specific gene networks by knowledge-driven matrix factorization.
Yang X, Zhou Y, Jin R, Chan C. Yang X, et al. Bioinformatics. 2009 Sep 1;25(17):2236-43. doi: 10.1093/bioinformatics/btp376. Epub 2009 Jun 19. Bioinformatics. 2009. PMID: 19542155 Free PMC article.
Development of computations in bioscience and bioinformatics and its application: review of the Symposium of Computations in Bioinformatics and Bioscience (SCBB06).
Deng Y, Ni J, Zhang C. Deng Y, et al. BMC Bioinformatics. 2006 Dec 12;7 Suppl 4(Suppl 4):S1. doi: 10.1186/1471-2105-7-S4-S1. BMC Bioinformatics. 2006. PMID: 17217501 Free PMC article.
Serum microRNA as a potential biomarker for the activity of thyroid eye disease.
Kim N, Choung H, Kim YJ, Woo SE, Yang MK, Khwarg SI, Lee MJ. Kim N, et al. Sci Rep. 2023 Jan 5;13(1):234. doi: 10.1038/s41598-023-27483-w. Sci Rep. 2023. PMID: 36604580 Free PMC article.

See all "Cited by" articles

References

1. Stears RL. Trends in Microarray analysis. Nature Medicine. 2003;9:140–145. doi: 10.1038/nm0103-140. - DOI - PubMed
1. Botstein D, Brown P. Exploring the new world of the genome with DNA microarrays. Nature Genetics. 1999;21:33–37. - PubMed
1. Sneath , Sokal Hierarchical Clustering. 1973.
1. Kohonen T. Self-Organizing Maps. Springer, Berlin; 1995.
1. Dopazo J, Zanders E, Dragoni I, Amphlett G, Falciani F. Methods and approaches in the analysis of gene expression data. J Immunol Methods. 2001;250:93–112. doi: 10.1016/S0022-1759(01)00307-6. - DOI - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- Saccharomyces Genome Database

[1] Stears RL. Trends in Microarray analysis. Nature Medicine. 2003;9:140–145. doi: 10.1038/nm0103-140. - DOI - PubMed

[2] Stears RL. Trends in Microarray analysis. Nature Medicine. 2003;9:140–145. doi: 10.1038/nm0103-140. - DOI - PubMed

[3] Botstein D, Brown P. Exploring the new world of the genome with DNA microarrays. Nature Genetics. 1999;21:33–37. - PubMed

[4] Botstein D, Brown P. Exploring the new world of the genome with DNA microarrays. Nature Genetics. 1999;21:33–37. - PubMed

[5] Sneath , Sokal Hierarchical Clustering. 1973.

[6] Sneath , Sokal Hierarchical Clustering. 1973.

[7] Kohonen T. Self-Organizing Maps. Springer, Berlin; 1995.

[8] Kohonen T. Self-Organizing Maps. Springer, Berlin; 1995.

[9] Dopazo J, Zanders E, Dragoni I, Amphlett G, Falciani F. Methods and approaches in the analysis of gene expression data. J Immunol Methods. 2001;250:93–112. doi: 10.1016/S0022-1759(01)00307-6. - DOI - PubMed

[10] Dopazo J, Zanders E, Dragoni I, Amphlett G, Falciani F. Methods and approaches in the analysis of gene expression data. J Immunol Methods. 2001;250:93–112. doi: 10.1016/S0022-1759(01)00307-6. - DOI - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Clustering of gene expression data: performance and similarity analysis

Affiliation

Clustering of gene expression data: performance and similarity analysis

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Molecular Biology Databases