Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions

R L Somorjai¹, B Dolenko, R Baumgartner

Affiliations

PMID: 12912828
DOI: 10.1093/bioinformatics/btg182

Comparative Study

Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions

R L Somorjai et al. Bioinformatics. 2003.

. 2003 Aug 12;19(12):1484-91.

doi: 10.1093/bioinformatics/btg182.

Authors

R L Somorjai¹, B Dolenko, R Baumgartner

Affiliation

¹ Institute for Biodiagnostics, National Research Council Canada, Winnipeg, MB, Canada R3B 1Y6. Ray.Somorjai@nrc-cnrc.gc.ca

PMID: 12912828
DOI: 10.1093/bioinformatics/btg182

Abstract

Motivation: Two practical realities constrain the analysis of microarray data, mass spectra from proteomics, and biomedical infrared or magnetic resonance spectra. One is the 'curse of dimensionality': the number of features characterizing these data is in the thousands or tens of thousands. The other is the 'curse of dataset sparsity': the number of samples is limited. The consequences of these two curses are far-reaching when such data are used to classify the presence or absence of disease.

Results: Using very simple classifiers, we show for several publicly available microarray and proteomics datasets how these curses influence classification outcomes. In particular, even if the sample per feature ratio is increased to the recommended 5-10 by feature extraction/reduction methods, dataset sparsity can render any classification result statistically suspect. In addition, several 'optimal' feature sets are typically identifiable for sparse datasets, all producing perfect classification results, both for the training and independent validation sets. This non-uniqueness leads to interpretational difficulties and casts doubt on the biological relevance of any of these 'optimal' feature sets. We suggest an approach to assess the relative quality of apparently equally good classifiers.

PubMed Disclaimer

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
- Ovid Technologies, Inc.
- Silverchair Information Systems
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions

Affiliation

Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions

Authors

Affiliation

Abstract

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources