Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2005 Sep 28:6:239.
doi: 10.1186/1471-2105-6-239.

Optimized between-group classification: a new jackknife-based gene selection procedure for genome-wide expression data

Affiliations

Optimized between-group classification: a new jackknife-based gene selection procedure for genome-wide expression data

Florent Baty et al. BMC Bioinformatics. .

Abstract

Background: A recent publication described a supervised classification method for microarray data: Between Group Analysis (BGA). This method which is based on performing multivariate ordination of groups proved to be very efficient for both classification of samples into pre-defined groups and disease class prediction of new unknown samples. Classification and prediction with BGA are classically performed using the whole set of genes and no variable selection is required. We hypothesize that an optimized selection of highly discriminating genes might improve the prediction power of BGA.

Results: We propose an optimized between-group classification (OBC) which uses a jackknife-based gene selection procedure. OBC emphasizes classification accuracy rather than feature selection. OBC is a backward optimization procedure that maximizes the percentage of between group inertia by removing the least influential genes one by one from the analysis. This selects a subset of highly discriminative genes which optimize disease class prediction. We apply OBC to four datasets and compared it to other classification methods.

Conclusion: OBC considerably improved the classification and predictive accuracy of BGA, when assessed using independent data sets and leave-one-out cross-validation.

Availability: The R code is freely available [see Additional file 1] as well as supplementary information [see Additional file 2].

PubMed Disclaimer

Figures

Figure 1
Figure 1
Overall description of OBC. Three steps are required to perform OBC optimization. In the first pre-selection step, n most discriminating genes are selected by performing a BGA on the training set with the whole set of genes. In the second step, a jackknife optimization is performed on the initial subset of genes and the least influential genes in terms of % BG inertia are removed successively. This second step is iteratively computed, decrementing the genes down to 5. Finally in the third step, the optimal subset of genes is identified (subset with the best classification accuracy and the best stability).
Figure 2
Figure 2
Optimized between-group classification applied to sarcoidosis data. In panel A, 24 individuals (solid circles) in the training set (H: healthy controls, SI: sarcoidosis stage I, SII: sarcoidosis stage II/III) and 8 individuals (empty circles) in the test set (283, 286, 287, 289 and 290 as H; 282, 284 and 285 as SII) are classified by a standard BGA using the whole set of genes. Panel B shows the different parameters of OBC as a function of the number of genes used in the analysis: the percentage of between group inertia (solid line), the percentage of good cross-validation (dashed line) and the variance of between group inertia (dot-dashed line). For indication, the percentage of test samples correctly predicted is represented by a dotted line. This parameter was not used in optimization of the training model. The vertical line shows the optimal number of genes. In panel C, the 105 most discriminating genes (initial subset) are located at the periphery of the biplot (black crosses) and the 58 optimal genes are highlighted (circled crosses). In panel D, 8 test-samples are classified using a BGA based on the 58 optimal genes.
Figure 3
Figure 3
Optimized between-group classification applied to tumour data. In panel A, 63 samples (solid circles) of the training set (BL: Burkitt's lymphoma, EWS: Ewing's sarcoma, NB: neuroblastoma, RMS: rhabdomyo sarcoma) and 25 samples (empty circles) of the test set (7, 15 and 18 as BL-NHL; 2, 6, 12, 19, 20 and 21 as EWS; 1, 8, 14, 16, 23 and 25 as NB; 4, 10, 17, 22 and 24 as RMS; 3, 5, 9, 11 and 13 as control samples that do not belong to one of the 4 groups) are classified by the standard BGA based on the whole set of genes. Panel B shows the different parameters of OBC as a function of the number of genes used in the analysis: the percentage of between group inertia (solid line), the percentage of good cross-validation (dashed line) and the variance of between group inertia (dot-dashed line). For indication, the percentage of test samples correctly predicted is represented by a dotted line. This parameter was not used in optimization of the training model. The vertical line shows the optimal number of genes. In panel C, the 245 most discriminating genes are represented with small crosses and the 90 optimal genes are highlighted (circled crosses). In panel D, the 25 test-samples are classified using a BGA based on the 90 optimal genes.
Figure 4
Figure 4
Analysis of sensitivity and specificity. The sensitivity and specificity of OBC (solid circles) were compared to standard BGA (empty circles). The prediction accuracy of OBC when applied to the sarcoidosis dataset was assessed using (A) LOOCV (left panel) and classification of the independent dataset (right panel). OBC was also applied to the tumour dataset and tested using (B) LOOCV (left panel) and classification of the independent dataset (right panel). Arrows show the improvement of sensitivity and specificity obtained with OBC compared to the standard BGA.
Figure 5
Figure 5
Number of genes included in the initial subset. This plot shows the maximum % BG inertia reached by the optimization procedure as a function of the number of genes present in the initial subset of genes (top curve: tumour data; bottom curve: sarcoidosis data). The dashed lines delimit the optimal size of the initial subset of genes for both datasets (above which the gain in % BG inertia is lower).

Similar articles

Cited by

References

    1. Li L, Pedersen LG, Darden TA, Weinberg CR. Class prediction and discovery based on gene expression data. Genome Information Systems and Technology. 2001.
    1. Yeung KY, Bumgarner RE. Multiclass classification of microarray data with repeated measurements: application to cancer. Genome Biol. 2003;4:R83. doi: 10.1186/gb-2003-4-12-r83. - DOI - PMC - PubMed
    1. Li W, Sun F, Grosse I. Extreme value distribution based gene selection criteria for discriminant microarray data analysis using logistic regression. J Comput Biol. 2004;11:215–226. doi: 10.1089/1066527041410445. - DOI - PubMed
    1. Tan Y, Shi L, Tong W, Wang C. Multi-class cancer classification by total principal component regression (TPCR) using microarray gene expression data. Nucleic Acids Res. 2005;33:56–65. doi: 10.1093/nar/gki144. - DOI - PMC - PubMed
    1. Xiong M, Jin L, Li W, Boerwinkle E. Computational methods for gene expression-based tumor classification. Biotechniques. 2000;29:1264–8. 1270. - PubMed

Publication types

MeSH terms