Are clusters found in one dataset present in another dataset?

Amy V Kapp¹, Robert Tibshirani

Affiliations

PMID: 16613834
DOI: 10.1093/biostatistics/kxj029

Comparative Study

Are clusters found in one dataset present in another dataset?

Amy V Kapp et al. Biostatistics. 2007 Jan.

. 2007 Jan;8(1):9-31.

doi: 10.1093/biostatistics/kxj029. Epub 2006 Apr 12.

Authors

Amy V Kapp¹, Robert Tibshirani

Affiliation

¹ Department of Statistics, Stanford University, Stanford, CA 94305-4065, USA. akapp@stanford.edu

PMID: 16613834
DOI: 10.1093/biostatistics/kxj029

Abstract

In many microarray studies, a cluster defined on one dataset is sought in an independent dataset. If the cluster is found in the new dataset, the cluster is said to be "reproducible" and may be biologically significant. Classifying a new datum to a previously defined cluster can be seen as predicting which of the previously defined clusters is most similar to the new datum. If the new data classified to a cluster are similar, molecularly or clinically, to the data already present in the cluster, then the cluster is reproducible and the corresponding prediction accuracy is high. Here, we take advantage of the connection between reproducibility and prediction accuracy to develop a validation procedure for clusters found in datasets independent of the one in which they were characterized. We define a cluster quality measure called the "in-group proportion" (IGP) and introduce a general procedure for individually validating clusters. Using simulations and real breast cancer datasets, the IGP is compared to four other popular cluster quality measures (homogeneity score, separation score, silhouette width, and weighted average discrepant pairs score). Moreover, simulations and the real breast cancer datasets are used to compare the four versions of the validation procedure which all use the IGP, but differ in the way in which the null distributions are generated. We find that the IGP is the best measure of prediction accuracy, and one version of the validation procedure is the more widely applicable than the other three. An implementation of this algorithm is in a package called "clusterRepro" available through The Comprehensive R Archive Network (http://cran.r-project.org).

PubMed Disclaimer

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

N01-HV-28183/HV/NHLBI NIH HHS/United States

LinkOut - more resources

Full Text Sources
- Ovid Technologies, Inc.
- Silverchair Information Systems

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Are clusters found in one dataset present in another dataset?

Affiliation

Are clusters found in one dataset present in another dataset?

Authors

Affiliation

Abstract

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources