Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Nov 13:13:298.
doi: 10.1186/1471-2105-13-298.

Improving accuracy for cancer classification with a new algorithm for genes selection

Affiliations

Improving accuracy for cancer classification with a new algorithm for genes selection

Hongyan Zhang et al. BMC Bioinformatics. .

Abstract

Background: Even though the classification of cancer tissue samples based on gene expression data has advanced considerably in recent years, it faces great challenges to improve accuracy. One of the challenges is to establish an effective method that can select a parsimonious set of relevant genes. So far, most methods for gene selection in literature focus on screening individual or pairs of genes without considering the possible interactions among genes. Here we introduce a new computational method named the Binary Matrix Shuffling Filter (BMSF). It not only overcomes the difficulty associated with the search schemes of traditional wrapper methods and overfitting problem in large dimensional search space but also takes potential gene interactions into account during gene selection. This method, coupled with Support Vector Machine (SVM) for implementation, often selects very small number of genes for easy model interpretability.

Results: We applied our method to 9 two-class gene expression datasets involving human cancers. During the gene selection process, the set of genes to be kept in the model was recursively refined and repeatedly updated according to the effect of a given gene on the contributions of other genes in reference to their usefulness in cancer classification. The small number of informative genes selected from each dataset leads to significantly improved leave-one-out (LOOCV) classification accuracy across all 9 datasets for multiple classifiers. Our method also exhibits broad generalization in the genes selected since multiple commonly used classifiers achieved either equivalent or much higher LOOCV accuracy than those reported in literature.

Conclusions: Evaluation of a gene's contribution to binary cancer classification is better to be considered after adjusting for the joint effect of a large number of other genes. A computationally efficient search scheme was provided to perform effective search in the extensive feature space that includes possible interactions of many genes. Performance of the algorithm applied to 9 datasets suggests that it is possible to improve the accuracy of cancer classification by a big margin when joint effects of many genes are considered.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Plot of LOOCV accuracy of LDA, NB, and SVM using k top ranked genes from SVM-RFE for k = 2, …, 150. The accuracy of SVM in general increases as more genes are included in the model. The accuracies of LDA and NB do not show an increasing pattern suggesting that the gene ranking by SVM-RFE is SVM specific and may not generalize well to NB or LDA. The plotted curves assume the number of genes is known (oracle situation). Without knowing the number of genes to be used, additional variability will add to the LOOCV accuracy. The diamond-shaped points show the LOOCV accuracy of the LDA, NB, and SVM classifiers using the genes selected by BMSF.
Figure 2
Figure 2
Comparison with top performance results reported in literature for nine cancer datasets.
Figure 3
Figure 3
Average of absolute correlation (AAC) at each stage and its relationship with NB, SVM. The top panel gives the AAC for each dataset. ‘Original’ refers to the entire dataset; ‘filtering’ refers to the stage after the procedures in Section 5.1; ‘Detailed evaluation’ refers to the stage at the end of Section 5.2. The bottom panels show the relationship between the AAC on the original dataset with NB, BMSF-NB, SVM, and BMSF-SVM classifiers. The original AAC appears to be reversely related to the accuracy of NB. The relationship of the original AAC with SVM is not obvious. BMSF-NB and BMSF-SVM are much less influenced by the original AAC.
Figure 4
Figure 4
The change in the number of selected genes in each round. The values labeled are the best MCC.
Figure 5
Figure 5
Joint effect of informative genes from multiple runs of the leukemia dataset. The left panel gives the LOOCV accuracy +/− standard error from 30 runs using the combined list of genes. The right panel gives the number of genes in the combined list +/− standard deviation from 30 runs. In both plots, the number of lists being combined is in the horizontal axis.
Figure 6
Figure 6
Comparison of different variable selection methods for the same classification algorithm. For each of the classification algorithms (LDA, QDA, SVM, NB), identical number of genes are selected for each cancer dataset by BMSF and 11 other variable selection criteria (the number of genes used is according to BMSF). The LOOCV accuracy is presented in the dotplot, in which the coordinate of a point in the horizontal axis indicates the accuracy. A point located to the right represents higher accuracy than a point located to the left. In most of the cases, the algorithms with variables selected by BMSF reach the highest LOOCV accuracy. For the GCM data, the variables selected by the eight criteria from RankGene and MaxRel cannot perform QDA due to rank deficiency. So the average accuracy for QDA is calculated over the other datasets for fair comparison.

References

    1. Chang CC, Lin CJ. LIBSVM: a library for support vector machines. ACM Trans Intell, Syst Technol. 2011;2(27):1–27.
    1. Geman D, D’Avignon C, Naiman D, Winslow R. Classifying gene expression profiles from pairwise mRNA comparisons. Stat Appl Genet Mol Biol. 2004. - DOI - PMC - PubMed
    1. Tan AC, Naiman DQ, Xu L, Winslow RL, Geman D. Simple decision rules for classifying human cancers from gene expression profiles. Bioinformatics. 2005;21:3896–3904. doi: 10.1093/bioinformatics/bti631. - DOI - PMC - PubMed
    1. Dagliyan O, Uney-Yuksektepe F, Kavakli IH, Turkay M. Optimization Based Tumor Classification from Microarray Gene Expression Data. PLoS One. 2011;6(2):e14579. doi: 10.1371/journal.pone.0014579. - DOI - PMC - PubMed
    1. Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci USA. 2002;99:6567–6572. doi: 10.1073/pnas.082099299. - DOI - PMC - PubMed

Publication types