Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2003 Dec 2:4:60.
doi: 10.1186/1471-2105-4-60.

Tumor classification and marker gene prediction by feature selection and fuzzy c-means clustering using microarray data

Affiliations
Comparative Study

Tumor classification and marker gene prediction by feature selection and fuzzy c-means clustering using microarray data

Junbai Wang et al. BMC Bioinformatics. .

Abstract

Background: Using DNA microarrays, we have developed two novel models for tumor classification and target gene prediction. First, gene expression profiles are summarized by optimally selected Self-Organizing Maps (SOMs), followed by tumor sample classification by Fuzzy C-means clustering. Then, the prediction of marker genes is accomplished by either manual feature selection (visualizing the weighted/mean SOM component plane) or automatic feature selection (by pair-wise Fisher's linear discriminant).

Results: The proposed models were tested on four published datasets: (1) Leukemia (2) Colon cancer (3) Brain tumors and (4) NCI cancer cell lines. The models gave class prediction with markedly reduced error rates compared to other class prediction approaches, and the importance of feature selection on microarray data analysis was also emphasized.

Conclusions: Our models identify marker genes with predictive potential, often better than other available methods in the literature. The models are potentially useful for medical diagnostics and may reveal some insights into cancer classification. Additionally, we illustrated two limitations in tumor classification from microarray data related to the biology underlying the data, in terms of (1) the class size of data, and (2) the internal structure of classes. These limitations are not specific for the classification models used.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Stress as a function of SOM reference vectors in model one. a) Leukemia data set. b) Colon data set. c) Brain tumor data set. d) NCI60 cancer cell line data set. In each plot, the optimal number of SOM reference vectors was marked by red vertical line and the number of SOM reference vectors was indicated by red text.
Figure 2
Figure 2
Weighted/mean SOM component plane. a) Weighted component planes of ALL and AML type of tumors in leukemia data set. d) Mean component planes of Normal and Tumor colon tissues in colon data set. c) Weighted component planes of MD, Mglio, Rhab, Ncer and PNET type of tumors in brain tumor data set. d) Weighted component planes of CNS, Renal, Breast, NSCLC, Ovarian, Leukemia, Colon and Melanoma type of cancer cell lines in NCI60 data set. In each plot, feature map units that identified by the manual feature selection of model one were marked by light green squares and detailed information of selected SOM map units can be found in our web supplement [22]. The color scale of weighted/mean component plane represented the expression level of SOM reference vectors, where red indicates high expression and green indicates low expression
Figure 3
Figure 3
Empirical cumulative distribution of the significant scores dE. a) Leukemia data set. b) Colon data set. c) Brain tumor data set. d) NCI60 cancer cell line data set. In each plot, the percentage of F(dE) that maximizes the classification performance was marked by red smooth line.
Figure 4
Figure 4
Test set error as a function of mean class size of the data set.
Figure 5
Figure 5
Diagrams of proposed two classifier models. a) The model one with the manual feature selection. b) The model two with the automatic feature selection.

Similar articles

Cited by

References

    1. Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson J, Jr, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Staudt LM. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature. 2002;403:503–511. - PubMed
    1. Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA. 1999;96:6745–6750. doi: 10.1073/pnas.96.12.6745. - DOI - PMC - PubMed
    1. Bezdek JC, Pal SK. Fuzzy models for pattern recognition method that search for structures in data. IEEE press New York. 1992.
    1. Ben-Dor A, Bruhn L, Friedman N, Nachman I, Washington U. Tissue classification with gene expression profiles. RECOMB Tokyo Japan. 2000. - PubMed
    1. Dettling M, Buhlmann P. Supervised clustering of genes. Genome Biol. 2002;3:12. doi: 10.1186/gb-2002-3-12-research0069. - DOI - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources