Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Aug 11;15(1):274.
doi: 10.1186/1471-2105-15-274.

A feature selection method for classification within functional genomics experiments based on the proportional overlapping score

Affiliations

A feature selection method for classification within functional genomics experiments based on the proportional overlapping score

Osama Mahmoud et al. BMC Bioinformatics. .

Abstract

Background: Microarray technology, as well as other functional genomics experiments, allow simultaneous measurements of thousands of genes within each sample. Both the prediction accuracy and interpretability of a classifier could be enhanced by performing the classification based only on selected discriminative genes. We propose a statistical method for selecting genes based on overlapping analysis of expression data across classes. This method results in a novel measure, called proportional overlapping score (POS), of a feature's relevance to a classification task.

Results: We apply POS, along-with four widely used gene selection methods, to several benchmark gene expression datasets. The experimental results of classification error rates computed using the Random Forest, k Nearest Neighbor and Support Vector Machine classifiers show that POS achieves a better performance.

Conclusions: A novel gene selection method, POS, is proposed. POS analyzes the expressions overlap across classes taking into account the proportions of overlapping samples. It robustly defines a mask for each gene that allows it to minimize the effect of expression outliers. The constructed masks along-with a novel gene score are exploited to produce the selected subset of genes.

PubMed Disclaimer

Figures

Figure 1
Figure 1
An example for two different genes with different overlapping pattern. Expression values of two different genes (i 1, i 2) each of which with 36 samples belonging to 2 classes, 18 samples for each class: (a) expression values of gene i 1, (b) expression values of gene i 2.
Figure 2
Figure 2
Core intervals with gene mask. An example for core expression intervals of a gene with 18 and 14 samples belonging to class 1, in red colour, and class 2, in green colour, respectively with its associated mask elements. Elements of the overlapping samples set and non‐overlapping samples set are highlighted by squares and circles respectively.
Figure 3
Figure 3
Illustration for overlapping intervals with different proportions. Examples for expression values of 2 genes distinguishing between 2 classes: (a) gene i 1 has overlapping samples distributed as 1:1, (b) gene i 2 has its overlapping samples distributed as 5:1 for class1:class2.
Figure 4
Figure 4
Averages of classification error rates for ‘Srbct’ and ‘Breast’ datasets. Average classification error rates for ‘Srbct’ and ‘Breast’ data based on 50 repetitions 10‐fold CV using ISIS, Wil‐RS, mRMR, MP and POS methods.
Figure 5
Figure 5
Log ratio between the error rates of the best compared method and the POS. Log ratios measure the improvement/deterioration achieved by the proposed method over the best compared method for three different classifiers; RF, kNN and SVM. The last panel shows the averages of log ratios across all datasets for each classifier.
Figure 6
Figure 6
Stability scores for ‘GSE27854’ dataset. Stability scores at different sizes of features sets that selected by Wil‐RS, MP and POS methods on ‘GSE27854’ dataset.
Figure 7
Figure 7
Stability scores for ‘GSE24514’ dataset. Stability scores at different sizes of features sets that selected by Wil‐RS, mRMR, MP and POS methods on ‘GSE24514’ dataset.
Figure 8
Figure 8
Stability‐accuracy plot for ‘Lung’ dataset. The stability of the feature selection methods against the corresponding estimated error rates on ‘Lung’ dataset. The error rates have been measured by 50 repetations of 10‐fold cross validation for three different classifiers: Random Forest (RF); k Nearest Neighbor (kNN); Support Vector Machine (SVM).
Figure 9
Figure 9
Stability‐accuracy plot for ‘GSE27854’ dataset. The stability of the feature selection methods against the corresponding estimated error rates on ‘GSE27854’ dataset. The error rates have been measured by 50 repetations of 10‐fold cros validation for three different classifiers: Random Forest (RF); k Nearest Neighbor (kNN); Support Vector Machine (SVM).

References

    1. Chen K‐H, Wang K‐J, Tsai M‐L, Wang K‐M, Adrian AM, Cheng W‐C, Yang T‐S, Teng N‐C, Tan K‐P, Chang K‐S. Gene selection for cancer identification: a decision tree model empowered by particle swarm optimization algorithm. BMC Bioinformatics. 2014;15(1):49. doi: 10.1186/1471-2105-15-49. - DOI - PMC - PubMed
    1. Dramiński M, Rada‐Iglesias A, Enroth S, Wadelius C, Koronacki J, Komorowski J. Monte carlo feature selection for supervised classification. Bioinformatics. 2008;24(1):110–117. doi: 10.1093/bioinformatics/btm486. - DOI - PubMed
    1. Marczyk M, Jaksik R, Polanski A, Polanska J. Adaptive filtering of microarray gene expression data based on gaussian mixture decomposition. BMC Bioinformatics. 2013;14(1):101. doi: 10.1186/1471-2105-14-101. - DOI - PMC - PubMed
    1. Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci. 2001;98(9):5116–5121. doi: 10.1073/pnas.091062498. - DOI - PMC - PubMed
    1. Zou C, Gong J, Li H. An improved sequence based prediction protocol for dna‐binding proteins using svm and comprehensive feature analysis. BMC Bioinformatics. 2013;14:90. doi: 10.1186/1471-2105-14-90. - DOI - PMC - PubMed

Publication types

Associated data