Are clusters found in one dataset present in another dataset?
- PMID: 16613834
- DOI: 10.1093/biostatistics/kxj029
Are clusters found in one dataset present in another dataset?
Abstract
In many microarray studies, a cluster defined on one dataset is sought in an independent dataset. If the cluster is found in the new dataset, the cluster is said to be "reproducible" and may be biologically significant. Classifying a new datum to a previously defined cluster can be seen as predicting which of the previously defined clusters is most similar to the new datum. If the new data classified to a cluster are similar, molecularly or clinically, to the data already present in the cluster, then the cluster is reproducible and the corresponding prediction accuracy is high. Here, we take advantage of the connection between reproducibility and prediction accuracy to develop a validation procedure for clusters found in datasets independent of the one in which they were characterized. We define a cluster quality measure called the "in-group proportion" (IGP) and introduce a general procedure for individually validating clusters. Using simulations and real breast cancer datasets, the IGP is compared to four other popular cluster quality measures (homogeneity score, separation score, silhouette width, and weighted average discrepant pairs score). Moreover, simulations and the real breast cancer datasets are used to compare the four versions of the validation procedure which all use the IGP, but differ in the way in which the null distributions are generated. We find that the IGP is the best measure of prediction accuracy, and one version of the validation procedure is the more widely applicable than the other three. An implementation of this algorithm is in a package called "clusterRepro" available through The Comprehensive R Archive Network (http://cran.r-project.org).
Similar articles
-
Weighted rank aggregation of cluster validation measures: a Monte Carlo cross-entropy approach.Bioinformatics. 2007 Jul 1;23(13):1607-15. doi: 10.1093/bioinformatics/btm158. Epub 2007 May 5. Bioinformatics. 2007. PMID: 17483500
-
Microarray gene cluster identification and annotation through cluster ensemble and EM-based informative textual summarization.IEEE Trans Inf Technol Biomed. 2009 Sep;13(5):832-40. doi: 10.1109/TITB.2009.2023984. Epub 2009 Jun 12. IEEE Trans Inf Technol Biomed. 2009. PMID: 19527962
-
Class discovery from gene expression data based on perturbation and cluster ensemble.IEEE Trans Nanobioscience. 2009 Jun;8(2):147-60. doi: 10.1109/TNB.2009.2023321. Epub 2009 Jun 2. IEEE Trans Nanobioscience. 2009. PMID: 19497836
-
Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting.J Natl Cancer Inst. 2007 Jan 17;99(2):147-57. doi: 10.1093/jnci/djk018. J Natl Cancer Inst. 2007. PMID: 17227998 Review.
-
Classification based upon gene expression data: bias and precision of error rates.Bioinformatics. 2007 Jun 1;23(11):1363-70. doi: 10.1093/bioinformatics/btm117. Epub 2007 Mar 28. Bioinformatics. 2007. PMID: 17392326 Review.
Cited by
-
The immune subtypes and landscape of sarcomas.BMC Immunol. 2022 Sep 24;23(1):46. doi: 10.1186/s12865-022-00522-3. BMC Immunol. 2022. PMID: 36153483 Free PMC article.
-
Post hoc pattern matching: assigning significance to statistically defined expression patterns in single channel microarray data.BMC Bioinformatics. 2007 Jul 5;8:240. doi: 10.1186/1471-2105-8-240. BMC Bioinformatics. 2007. PMID: 17615071 Free PMC article.
-
Noncanonical genomic imprinting in the monoamine system determines naturalistic foraging and brain-adrenal axis functions.Cell Rep. 2022 Mar 8;38(10):110500. doi: 10.1016/j.celrep.2022.110500. Cell Rep. 2022. PMID: 35263575 Free PMC article.
-
A Cross-Study Analysis for Reproducible Sub-classification of Traumatic Brain Injury.Front Neurol. 2018 Aug 13;9:606. doi: 10.3389/fneur.2018.00606. eCollection 2018. Front Neurol. 2018. PMID: 30150970 Free PMC article.
-
Transcriptomics Analysis Reveals Shared Pathways in Peripheral Blood Mononuclear Cells and Brain Tissues of Patients With Schizophrenia.Front Psychiatry. 2021 Sep 22;12:716722. doi: 10.3389/fpsyt.2021.716722. eCollection 2021. Front Psychiatry. 2021. PMID: 34630179 Free PMC article.
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources