Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes

doi:10.1186/1471-2105-7-397

. 2006 Aug 31:7:397.

doi: 10.1186/1471-2105-7-397.

Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes

Susmita Datta¹, Somnath Datta

Affiliations

PMID: 16945146
PMCID: PMC1590054
DOI: 10.1186/1471-2105-7-397

Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes

Susmita Datta et al. BMC Bioinformatics. 2006.

. 2006 Aug 31:7:397.

doi: 10.1186/1471-2105-7-397.

Authors

Susmita Datta¹, Somnath Datta

Affiliation

¹ Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, KY 40202, USA. susmita.datta@louisville.edu

PMID: 16945146
PMCID: PMC1590054
DOI: 10.1186/1471-2105-7-397

Abstract

Background: A cluster analysis is the most commonly performed procedure (often regarded as a first step) on a set of gene expression profiles. In most cases, a post hoc analysis is done to see if the genes in the same clusters can be functionally correlated. While past successes of such analyses have often been reported in a number of microarray studies (most of which used the standard hierarchical clustering, UPGMA, with one minus the Pearson's correlation coefficient as a measure of dissimilarity), often times such groupings could be misleading. More importantly, a systematic evaluation of the entire set of clusters produced by such unsupervised procedures is necessary since they also contain genes that are seemingly unrelated or may have more than one common function. Here we quantify the performance of a given unsupervised clustering algorithm applied to a given microarray study in terms of its ability to produce biologically meaningful clusters using a reference set of functional classes. Such a reference set may come from prior biological knowledge specific to a microarray study or may be formed using the growing databases of gene ontologies (GO) for the annotated genes of the relevant species.

Results: In this paper, we introduce two performance measures for evaluating the results of a clustering algorithm in its ability to produce biologically meaningful clusters. The first measure is a biological homogeneity index (BHI). As the name suggests, it is a measure of how biologically homogeneous the clusters are. This can be used to quantify the performance of a given clustering algorithm such as UPGMA in grouping genes for a particular data set and also for comparing the performance of a number of competing clustering algorithms applied to the same data set. The second performance measure is called a biological stability index (BSI). For a given clustering algorithm and an expression data set, it measures the consistency of the clustering algorithm's ability to produce biologically meaningful clusters when applied repeatedly to similar data sets. A good clustering algorithm should have high BHI and moderate to high BSI. We evaluated the performance of ten well known clustering algorithms on two gene expression data sets and identified the optimal algorithm in each case. The first data set deals with SAGE profiles of differentially expressed tags between normal and ductal carcinoma in situ samples of breast cancer patients. The second data set contains the expression profiles over time of positively expressed genes (ORF's) during sporulation of budding yeast. Two separate choices of the functional classes were used for this data set and the results were compared for consistency.

Conclusion: Functional information of annotated genes available from various GO databases mined using ontology tools can be used to systematically judge the results of an unsupervised clustering algorithm as applied to a gene expression data set in clustering genes. This information could be used to select the right algorithm from a class of clustering algorithms for the given data set.

PubMed Disclaimer

Figures

**Figure 1**
BHI for various clustering algorithms applied to the normal and DCIS samples in breast cancer data. The thick black line is the 95th percentile of BHI values under random clustering.

**Figure 2**
BSI for various clustering algorithms applied to the normal and DCIS samples in breast cancer data.

**Figure 3**
BHI for various clustering algorithms applied to the positively expressed genes in yeast sporulation data with functional classes from FatiGO. The thick black line is the 95th percentile of BHI values under random clustering.

**Figure 4**
BSI for various clustering algorithms applied to the positively expressed genes in yeast sporulation data with functional classes from FatiGO.

**Figure 5**
BHI for various clustering algorithms applied to the positively expressed genes in yeast sporulation data with functional classes from FunCat. The thick black line is the 95th percentile of BHI values under random clustering.

**Figure 6**
BSI for various clustering algorithms applied to the positively expressed genes in yeast sporulation data with functional classes from FunCat.

See this image and copyright information in PMC

Cited by

CellBIC: bimodality-based top-down clustering of single-cell RNA sequencing data reveals hierarchical structure of the cell type.
Kim J, Stanescu DE, Won KJ. Kim J, et al. Nucleic Acids Res. 2018 Nov 30;46(21):e124. doi: 10.1093/nar/gky698. Nucleic Acids Res. 2018. PMID: 30102368 Free PMC article.
Synergies of Radiomics and Transcriptomics in Lung Cancer Diagnosis: A Pilot Study.
Dovrou A, Bei E, Sfakianakis S, Marias K, Papanikolaou N, Zervakis M. Dovrou A, et al. Diagnostics (Basel). 2023 Feb 15;13(4):738. doi: 10.3390/diagnostics13040738. Diagnostics (Basel). 2023. PMID: 36832225 Free PMC article.
Identifying large-scale interaction atlases using probabilistic graphs and external knowledge.
Chanumolu SK, Otu HH. Chanumolu SK, et al. J Clin Transl Sci. 2022 Feb 11;6(1):e27. doi: 10.1017/cts.2022.18. eCollection 2022. J Clin Transl Sci. 2022. PMID: 35321220 Free PMC article.
New resampling method for evaluating stability of clusters.
Gana Dresen IM, Boes T, Huesing J, Neuhaeuser M, Joeckel KH. Gana Dresen IM, et al. BMC Bioinformatics. 2008 Jan 24;9:42. doi: 10.1186/1471-2105-9-42. BMC Bioinformatics. 2008. PMID: 18218074 Free PMC article.
Mining the modular structure of protein interaction networks.
Berenstein AJ, Piñero J, Furlong LI, Chernomoretz A. Berenstein AJ, et al. PLoS One. 2015 Apr 9;10(4):e0122477. doi: 10.1371/journal.pone.0122477. eCollection 2015. PLoS One. 2015. PMID: 25856434 Free PMC article.

See all "Cited by" articles

References

1. Quackenbush J. Computational analysis of microarray data. Nat Rev Genet. 2001;2:418–427. doi: 10.1038/35076576. - DOI - PubMed
1. Datta S, Arnold J. Some comparisons of clustering and classification techniques applied to transcriptional profiling data. In: Gulati C, Lin YX, Mishra S, Rayner J, editor. Advances in Statistics, Combinatorics and Related Areas. World Scientific; 2002. pp. 63–74.
1. Datta S, Datta S. Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics. 2003;19:459–466. doi: 10.1093/bioinformatics/btg025. - DOI - PubMed
1. Sneath PH, Snokal RR. Numerical Taxonomy. Freeman; 1973.
1. R http://www.r-project.org

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- Saccharomyces Genome Database
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

[1] Quackenbush J. Computational analysis of microarray data. Nat Rev Genet. 2001;2:418–427. doi: 10.1038/35076576. - DOI - PubMed

[2] Quackenbush J. Computational analysis of microarray data. Nat Rev Genet. 2001;2:418–427. doi: 10.1038/35076576. - DOI - PubMed

[3] Datta S, Arnold J. Some comparisons of clustering and classification techniques applied to transcriptional profiling data. In: Gulati C, Lin YX, Mishra S, Rayner J, editor. Advances in Statistics, Combinatorics and Related Areas. World Scientific; 2002. pp. 63–74.

[4] Datta S, Arnold J. Some comparisons of clustering and classification techniques applied to transcriptional profiling data. In: Gulati C, Lin YX, Mishra S, Rayner J, editor. Advances in Statistics, Combinatorics and Related Areas. World Scientific; 2002. pp. 63–74.

[5] Datta S, Datta S. Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics. 2003;19:459–466. doi: 10.1093/bioinformatics/btg025. - DOI - PubMed

[6] Datta S, Datta S. Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics. 2003;19:459–466. doi: 10.1093/bioinformatics/btg025. - DOI - PubMed

[7] Sneath PH, Snokal RR. Numerical Taxonomy. Freeman; 1973.

[8] Sneath PH, Snokal RR. Numerical Taxonomy. Freeman; 1973.

[9] R http://www.r-project.org

[10] R http://www.r-project.org

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes

Affiliation

Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Molecular Biology Databases

Research Materials

Miscellaneous

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Molecular Biology Databases

Research Materials

Miscellaneous