Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2006 Dec 12;7 Suppl 4(Suppl 4):S19.
doi: 10.1186/1471-2105-7-S4-S19.

Clustering of gene expression data: performance and similarity analysis

Affiliations
Comparative Study

Clustering of gene expression data: performance and similarity analysis

Longde Yin et al. BMC Bioinformatics. .

Abstract

Background: DNA Microarray technology is an innovative methodology in experimental molecular biology, which has produced huge amounts of valuable data in the profile of gene expression. Many clustering algorithms have been proposed to analyze gene expression data, but little guidance is available to help choose among them. The evaluation of feasible and applicable clustering algorithms is becoming an important issue in today's bioinformatics research.

Results: In this paper we first experimentally study three major clustering algorithms: Hierarchical Clustering (HC), Self-Organizing Map (SOM), and Self Organizing Tree Algorithm (SOTA) using Yeast Saccharomyces cerevisiae gene expression data, and compare their performance. We then introduce Cluster Diff, a new data mining tool, to conduct the similarity analysis of clusters generated by different algorithms. The performance study shows that SOTA is more efficient than SOM while HC is the least efficient. The results of similarity analysis show that when given a target cluster, the Cluster Diff can efficiently determine the closest match from a set of clusters. Therefore, it is an effective approach for evaluating different clustering algorithms.

Conclusion: HC methods allow a visual, convenient representation of genes. However, they are neither robust nor efficient. The SOM is more robust against noise. A disadvantage of SOM is that the number of clusters has to be fixed beforehand. The SOTA combines the advantages of both hierarchical and SOM clustering. It allows a visual representation of the clusters and their structure and is not sensitive to noises. The SOTA is also more flexible than the other two clustering methods. By using our data mining tool, Cluster Diff, it is possible to analyze the similarity of clusters generated by different algorithms and thereby enable comparisons of different clustering methods.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Runtimes for SOTA and hierarchical. For a large number of genes (>1000), SOTA is obviously faster than HC. However, for a relatively small number (<1000) of genes, the performance of the SOTA and that of HC method are similar.
Figure 2
Figure 2
Runtime for SOM and SOTA. The runtime of SOTA and SOM are proportional to the sample sizes, and the computation using SOTA is faster than the SOM.
Figure 3
Figure 3
Clustering result of SOTA. The size of the ratio of the circles is proportional to the amount of genes in that cluster. The patterns of the clusters appear on the right of the circles.
Figure 4
Figure 4
Clustering results of SOM. Each rectangle corresponds to a node of the map. The black thick line in the rectangle corresponds to the profile of the node, and the grey lines correspond to the profiles of the genes in that cluster. The black bars on the left of the profiles are proportional to the number of genes in the clusters.
Figure 5
Figure 5
Screenshot of the Cluster Diff window. The main window contains the file, view, and help buttons. In this figure, the left group (A) has 6 clusters, from A0 to A5; the right group (B) has 8 clusters, from B0 to B7. In each cluster, the column represents the dimension of the Microarray data and the row represents the gene's profile. The score is the measurement of similarity.
Figure 6
Figure 6
Screenshot of cluster similarity analysis. The similarity analysis results of clusters generated by SOTA and SOM. The matched parts are linked by lines in grey colour.
Figure 7
Figure 7
Example of a good matched clusters. The profiles of these two clusters have similar trends, meaning that most genes in the two clusters are similar.
Figure 8
Figure 8
Example of a bad matched clusters. Two clusters are mismatched, their trends are different.

Similar articles

Cited by

References

    1. Stears RL. Trends in Microarray analysis. Nature Medicine. 2003;9:140–145. doi: 10.1038/nm0103-140. - DOI - PubMed
    1. Botstein D, Brown P. Exploring the new world of the genome with DNA microarrays. Nature Genetics. 1999;21:33–37. - PubMed
    1. Sneath , Sokal Hierarchical Clustering. 1973.
    1. Kohonen T. Self-Organizing Maps. Springer, Berlin; 1995.
    1. Dopazo J, Zanders E, Dragoni I, Amphlett G, Falciani F. Methods and approaches in the analysis of gene expression data. J Immunol Methods. 2001;250:93–112. doi: 10.1016/S0022-1759(01)00307-6. - DOI - PubMed

MeSH terms

LinkOut - more resources