On the selection of appropriate distances for gene expression data clustering

Pablo A Jaskowiak, Ricardo J G B Campello, Ivan G Costa

PMID: 24564555
PMCID: PMC4072854
DOI: 10.1186/1471-2105-15-S2-S2

On the selection of appropriate distances for gene expression data clustering

Pablo A Jaskowiak et al. BMC Bioinformatics. 2014.

. 2014;15 Suppl 2(Suppl 2):S2.

doi: 10.1186/1471-2105-15-S2-S2. Epub 2014 Jan 24.

Authors

Pablo A Jaskowiak, Ricardo J G B Campello, Ivan G Costa

PMID: 24564555
PMCID: PMC4072854
DOI: 10.1186/1471-2105-15-S2-S2

Abstract

Background: Clustering is crucial for gene expression data analysis. As an unsupervised exploratory procedure its results can help researchers to gain insights and formulate new hypothesis about biological data from microarrays. Given different settings of microarray experiments, clustering proves itself as a versatile exploratory tool. It can help to unveil new cancer subtypes or to identify groups of genes that respond similarly to a specific experimental condition. In order to obtain useful clustering results, however, different parameters of the clustering procedure must be properly tuned. Besides the selection of the clustering method itself, determining which distance is going to be employed between data objects is probably one of the most difficult decisions.

Results and conclusions: We analyze how different distances and clustering methods interact regarding their ability to cluster gene expression, i.e., microarray data. We study 15 distances along with four common clustering methods from the literature on a total of 52 gene expression microarray datasets. Distances are evaluated on a number of different scenarios including clustering of cancer tissues and genes from short time-series expression data, the two main clustering applications in gene expression. Our results support that the selection of an appropriate distance depends on the scenario in hand. Moreover, in each scenario, given the very same clustering method, significant differences in quality may arise from the selection of distinct distance measures. In fact, the selection of an appropriate distance measure can make the difference between meaningful and poor clustering outcomes, even for a suitable clustering method.

PubMed Disclaimer

Figures

**Figure 1**
**Cancer Datasets Results: Class recovery obtained for cancer datasets regarding the three evaluation scenarios under consideration, subfigures (a), (b), and (c)**. Bars display mean results for each pair of clustering method and distance function in different types of datasets: cDNA (left) and Affymetrix (right).

**Figure 2**
**Robustness to Noise for Cancer Datasets: ARI values for different noise levels (%) regarding PE, JK, SP, RM, COS and EUC**. Plots correspond to the mean ARI values for runs performed in 100 different noisy datasets with the same amount (%) of noise points. Bars account for standard deviations.

**Figure 3**
**Gene Time-Series Results: Results for gene time-series data**. Figures (a), (b) and (c) depict pairwise comparison of distances for each clustering method. Figure (d) depicts an all against all pairwise comparison. Each cell account for the number of datasets in which the method from the row obtained a better enrichment than the method from the column. The "hotter"/"colder" the cell the better/worst is the row method in comparison to the column one.

See this image and copyright information in PMC

References

1. Brazma A, Vilo J. Gene expression data analysis. FEBS Letters. 2000;480(1):17–24. doi: 10.1016/S0014-5793(00)01772-5. - DOI - PubMed
1. Jiang D, Tang C, Zhang A. Cluster analysis for gene expression data: A survey. IEEE Transactions on Knowledge and Data Engineering. 2004;16(11):1370–1386. doi: 10.1109/TKDE.2004.68. - DOI
1. Zhang A. Advanced Analysis of Gene Expression Microarray Data. 1. World Scientific Publishing Company; 2006.
1. Souto M, Costa I, de Araujo D, Ludermir T, Schliep A. Clustering cancer gene expression data: a comparative study. BMC Bioinformatics. 2008;9(1):497. doi: 10.1186/1471-2105-9-497. - DOI - PMC - PubMed
1. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science. 1999;286(5439):531–537. doi: 10.1126/science.286.5439.531. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

On the selection of appropriate distances for gene expression data clustering

On the selection of appropriate distances for gene expression data clustering

Authors

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials