Exploring the within- and between-class correlation distributions for tumor classification

Xuelian Wei¹, Ker-Chau Li

Affiliations

PMID: 20339085
PMCID: PMC2872377
DOI: 10.1073/pnas.0910140107

Exploring the within- and between-class correlation distributions for tumor classification

Xuelian Wei et al. Proc Natl Acad Sci U S A. 2010.

. 2010 Apr 13;107(15):6737-42.

doi: 10.1073/pnas.0910140107. Epub 2010 Mar 25.

Authors

Xuelian Wei¹, Ker-Chau Li

Affiliation

¹ Department of Statistics, University of California, 8125 Math Sciences Building, Box 951554, Los Angeles, CA 90095-1554, USA.

PMID: 20339085
PMCID: PMC2872377
DOI: 10.1073/pnas.0910140107

Abstract

To many biomedical researchers, effective tumor classification methods such as the support vector machine often appear like a black box not only because the procedures are complex but also because the required specifications, such as the choice of a kernel function, suffer from a clear guidance either mathematically or biologically. As commonly observed, samples within the same tumor class tend to be more similar in gene expression than samples from different tumor classes. But can this well-received observation lead to a useful procedure of classification and prediction? To address this issue, we first conceived a statistical framework and derived general conditions to serve as the theoretical foundation that supported the aforementioned empirical observation. Then we constructed a classification procedure that fully utilized the information obtained by comparing the distributions of within-class correlations with between-class correlations via Kullback-Leibler divergence. We compared our approach with many machine-learning techniques by applying to 22 binary- and multiclass gene-expression datasets involving human cancers. The results showed that our method performed as efficiently as support vector machine and Naïve Bayesian and outperformed other learning methods (decision trees, linear discriminate analysis, and k-nearest neighbor). In addition, we conducted a simulation study and showed that our method would be more effective if the arriving new samples are subject to the often-encountered baseline shift or increased noise level problems. Our method can be extended for general classification problems when only the similarity scores between samples are available.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

**Fig. 1.**
Box plot for accuracy rates from 100 3-fold cross-validation tests using disease-related genes alone for different methods.

**Fig. 2.**
A carton example to illustrate the flow chart of DBC. Step I (*Top*): feature detection by estimating the reference similarity distribution matrix {f_ij}_K×K. Step II (*Middle*): prediction based on K parallel hypothesis tests. Step III (*Bottom*): decision rule based on the results of above K parallel hypothesis tests, and weighed KL decision rule applied on these “unclassified” samples.

See this image and copyright information in PMC

References

1. Garber ME, et al. Diversity of gene expression in adenocarcinoma of the lung. Proc Natl Acad Sci USA. 2001;98(24):13784–13789. - PMC - PubMed
1. Lapointe J, et al. Gene expression profiling identifies clinically relevant subtypes of prostate cancer. Proc Natl Acad Sci USA. 2004;101(3):811–816. - PMC - PubMed
1. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 1998;95(25):14863–14868. - PMC - PubMed
1. Golub TR, et al. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science. 1999;286(5439):531–537. - PubMed
1. Wong DJ, et al. Revealing targeted therapy for human cancer by gene module maps. Cancer Res. 2008;68(2):369–378. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Exploring the within- and between-class correlation distributions for tumor classification

Affiliation

Exploring the within- and between-class correlation distributions for tumor classification

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Molecular Biology Databases