Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2005 Apr 13:6:97.
doi: 10.1186/1471-2105-6-97.

Many accurate small-discriminatory feature subsets exist in microarray transcript data: biomarker discovery

Affiliations

Many accurate small-discriminatory feature subsets exist in microarray transcript data: biomarker discovery

Leslie R Grate. BMC Bioinformatics. .

Abstract

Background: Molecular profiling generates abundance measurements for thousands of gene transcripts in biological samples such as normal and tumor tissues (data points). Given such two-class high-dimensional data, many methods have been proposed for classifying data points into one of the two classes. However, finding very small sets of features able to correctly classify the data is problematic as the fundamental mathematical proposition is hard. Existing methods can find "small" feature sets, but give no hint how close this is to the true minimum size. Without fundamental mathematical advances, finding true minimum-size sets will remain elusive, and more importantly for the microarray community there will be no methods for finding them.

Results: We use the brute force approach of exhaustive search through all genes, gene pairs (and for some data sets gene triples). Each unique gene combination is analyzed with a few-parameter linear-hyperplane classification method looking for those combinations that form training error-free classifiers. All 10 published data sets studied are found to contain predictive small feature sets. Four contain thousands of gene pairs and 6 have single genes that perfectly discriminate.

Conclusion: This technique discovered small sets of genes (3 or less) in published data that form accurate classifiers, yet were not reported in the prior publications. This could be a common characteristic of microarray data, thus making looking for them worth the computational cost. Such small gene sets could indicate biomarkers and portend simple medical diagnostic tests. We recommend checking for small gene sets routinely. We find 4 gene pairs and many gene triples in the large hepatocellular carcinoma (HCC, Liver cancer) data set of Chen et al. The key component of these is the "placental gene of unknown function", PLAC8. Our HMM modeling indicates PLAC8 might have a domain like part of lP59's crystal structure (a Non-Covalent Endonuclease lii-Dna Complex). The previously identified HCC biomarker gene, glypican 3 (GPC3), is part of an accurate gene triple involving MT1E and ARHE. We also find small gene sets that distinguish leukemia subtypes in the large pediatric acute lymphoblastic leukemia cancer set of Yeoh et al.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Example plot using two hypothetical genes. Each data point is labeled with the class, and the separating plane is computed to be positioned halfway between the two classes. In this example there is a large separation between the two classes and perfect separation is achieved and no data point is close to the plane.
Figure 2
Figure 2
Liver cancer pair PLAC8 verses BCAT2. The two misclassified samples 108 and 109 are shown as squares. There are 3 other genes that form such pairs with PLAC8.
Figure 3
Figure 3
Liver cancer 3D plot of MT1E, ARHE and GPC3. These 3 genes form a perfect classifier although the margin is small. Red are cancer samples. The web site contains an interactive plot.
Figure 4
Figure 4
1P59 crystal structure. Shown with the alignment hit to the liver cancer possible biomarker PLAC8 highlighted in strands at the top. Alignment generated from PFAM model pfam04749.5DUF614 using the SAM HMM system and displayed in RASMOL.
Figure 5
Figure 5
From the YeohALL data, T vs the rest, the best single gene CD3D. This gene perfectly separates the classes. Plus signs are T subtype samples.
Figure 6
Figure 6
From the YeohALL data, T vs the rest, the best pair HLA-DRA and HUMTCBYY. Each gene alone provides some classification power, but when linearly combined form a perfect classifier, albeit with a small margin. Plus signs are the T subtype samples.
Figure 7
Figure 7
From the YeohALL data, E2A vs the rest, the best single gene PBX1. This gene perfectly separates the classes with a wide margin and has higher values in E2A. Plus signs are E2A subtype samples.
Figure 8
Figure 8
From the YeohALL data, E2A vs the rest, the best gene pair RPS6 and LRMP. Plus signs are the E2A subtype samples. LRMP by itself is a reasonable indicator of E2A status, but when combined with RPS6 can perfectly separate the data.

Similar articles

Cited by

References

    1. Chen X, Cheung S, So S, Fan S, Barry C, Higgins J, Lai K, Ji J, Dudoit S, Ng I, Van De Rijn M, Botstein D, Brown P. Gene expression patterns in human liver cancers. Mol Biol Cell. 2002;13:1929–1939. doi: 10.1091/mbc.02-02-0023.. - DOI - PMC - PubMed
    1. Liotta L, Ferrari M, Petricoin E. Clinical proteomics: Written in blood. Nature. 2003;425:905. doi: 10.1038/425905a. - DOI - PubMed
    1. Brown M, Grundy W, Lin D, Cristianini N, Sugnet C, Furey T, Ares M, Jr, Haussler D. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natil Acad Sci U S A. 2000;97:262–267. doi: 10.1073/pnas.97.1.262. - DOI - PMC - PubMed
    1. Moler E, Chow M, Mian I. Analysis of molecular profile data using generative and discriminative methods. Physiological Genomics. 2000;4:109–126. - PubMed
    1. Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov J, Poggio T, Gerald W, Loda M, Lander E, Golub T. Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci. 2001;98:15149–15154. doi: 10.1073/pnas.211566398. - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources