Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Apr 13;107(15):6737-42.
doi: 10.1073/pnas.0910140107. Epub 2010 Mar 25.

Exploring the within- and between-class correlation distributions for tumor classification

Affiliations

Exploring the within- and between-class correlation distributions for tumor classification

Xuelian Wei et al. Proc Natl Acad Sci U S A. .

Abstract

To many biomedical researchers, effective tumor classification methods such as the support vector machine often appear like a black box not only because the procedures are complex but also because the required specifications, such as the choice of a kernel function, suffer from a clear guidance either mathematically or biologically. As commonly observed, samples within the same tumor class tend to be more similar in gene expression than samples from different tumor classes. But can this well-received observation lead to a useful procedure of classification and prediction? To address this issue, we first conceived a statistical framework and derived general conditions to serve as the theoretical foundation that supported the aforementioned empirical observation. Then we constructed a classification procedure that fully utilized the information obtained by comparing the distributions of within-class correlations with between-class correlations via Kullback-Leibler divergence. We compared our approach with many machine-learning techniques by applying to 22 binary- and multiclass gene-expression datasets involving human cancers. The results showed that our method performed as efficiently as support vector machine and Naïve Bayesian and outperformed other learning methods (decision trees, linear discriminate analysis, and k-nearest neighbor). In addition, we conducted a simulation study and showed that our method would be more effective if the arriving new samples are subject to the often-encountered baseline shift or increased noise level problems. Our method can be extended for general classification problems when only the similarity scores between samples are available.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig. 1.
Fig. 1.
Box plot for accuracy rates from 100 3-fold cross-validation tests using disease-related genes alone for different methods.
Fig. 2.
Fig. 2.
A carton example to illustrate the flow chart of DBC. Step I (Top): feature detection by estimating the reference similarity distribution matrix {fij}K×K. Step II (Middle): prediction based on K parallel hypothesis tests. Step III (Bottom): decision rule based on the results of above K parallel hypothesis tests, and weighed KL decision rule applied on these “unclassified” samples.

Similar articles

Cited by

References

    1. Garber ME, et al. Diversity of gene expression in adenocarcinoma of the lung. Proc Natl Acad Sci USA. 2001;98(24):13784–13789. - PMC - PubMed
    1. Lapointe J, et al. Gene expression profiling identifies clinically relevant subtypes of prostate cancer. Proc Natl Acad Sci USA. 2004;101(3):811–816. - PMC - PubMed
    1. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 1998;95(25):14863–14868. - PMC - PubMed
    1. Golub TR, et al. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science. 1999;286(5439):531–537. - PubMed
    1. Wong DJ, et al. Revealing targeted therapy for human cancer by gene module maps. Cancer Res. 2008;68(2):369–378. - PubMed

Publication types