Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011;6(9):e24259.
doi: 10.1371/journal.pone.0024259. Epub 2011 Sep 1.

Identification of single- and multiple-class specific signature genes from gene expression profiles by group marker index

Affiliations

Identification of single- and multiple-class specific signature genes from gene expression profiles by group marker index

Yu-Shuen Tsai et al. PLoS One. 2011.

Abstract

Informative genes from microarray data can be used to construct prediction model and investigate biological mechanisms. Differentially expressed genes, the main targets of most gene selection methods, can be classified as single- and multiple-class specific signature genes. Here, we present a novel gene selection algorithm based on a Group Marker Index (GMI), which is intuitive, of low-computational complexity, and efficient in identification of both types of genes. Most gene selection methods identify only single-class specific signature genes and cannot identify multiple-class specific signature genes easily. Our algorithm can detect de novo certain conditions of multiple-class specificity of a gene and makes use of a novel non-parametric indicator to assess the discrimination ability between classes. Our method is effective even when the sample size is small as well as when the class sizes are significantly different. To compare the effectiveness and robustness we formulate an intuitive template-based method and use four well-known datasets. We demonstrate that our algorithm outperforms the template-based method in difficult cases with unbalanced distribution. Moreover, the multiple-class specific genes are good biomarkers and play important roles in biological pathways. Our literature survey supports that the proposed method identifies unique multiple-class specific marker genes (not reported earlier to be related to cancer) in the Central Nervous System data. It also discovers unique biomarkers indicating the intrinsic difference between subtypes of lung cancer. We also associate the pathway information with the multiple-class specific signature genes and cross-reference to published studies. We find that the identified genes participate in the pathways directly involved in cancer development in leukemia data. Our method gives a promising way to find genes that can involve in pathways of multiple diseases and hence opens up the possibility of using an existing drug on other diseases as well as designing a single drug for multiple diseases.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Scatter-plots of the top most gene of each level in the SRBCT data set.
Panels (a), (b) and (c) are the scatter-plots of the top most gene of level-1, level-2, and level-3, respectively. The top most genes are WAS (236282), PTPN12 (774502) and GSTA4 (504791), respectively. There are four classes in the SRBCT data set: Ewing sarcomas (EWS), Burkitt lymphomas (BL), neuroblastomas (NB), and rhabdomyosarcomas (RMS).
Figure 2
Figure 2. The level-2-like and level-1-like genes ranked within top 10 level-3 genes by template-based method in the Lung Cancer data set.
Panels (a), (b), (c) and (d) are the scatter-plots of the level-2-like genes. Panels (e) and (f) are the scatter-plots of the level-1-like genes.
Figure 3
Figure 3. Effect of sample size on Pearson's correlation coefficient values.
Figure 4
Figure 4. Steps involved to compute GMI and to find the list of group specific genes for each level of discrimination.
Figure 5
Figure 5. A 5-class synthetic example to illustrate computation of GMI.
There are four levels of discrimination in the 5-class synthetic data set. Panels (a) to (d) depict the computation of GMI values at each level of discrimination. The dotted lines in each panel indicate the two mean values used for GMI computation in each level of discrimination. All filled samples in each panel indicate the upper group samples. The remaining open samples in each panel indicate the lower group samples.

Similar articles

Cited by

References

    1. Tsai YS, Lin CT, Tseng GC, Chung IF, Pal NR. Discovery of dominant and dormant genes from expression data using a novel generalization of SNR for multi-class problems. BMC Bioinformatics. 2008;9:425. doi: 10.1186/1471-2105-9-425. - DOI - PMC - PubMed
    1. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286:531–537. doi: 10.1126/science.286.5439.531. - DOI - PubMed
    1. Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc. 2002;97:77–87. doi: 10.1198/016214502753479248. - DOI
    1. Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Mach Learn. 2002;46:389–422.
    1. Pal NR, Aguan K, Sharma A, Amari SI. Discovering biomarkers from gene expression data for predicting cancer subgroups using neural networks and relational fuzzy clustering. BMC Bioinformatics. 2007;8:5. doi: 10.1186/1471-2105-8-5. - DOI - PMC - PubMed

Publication types

Substances