. 2005 Sep 28:6:239.

doi: 10.1186/1471-2105-6-239.

Optimized between-group classification: a new jackknife-based gene selection procedure for genome-wide expression data

Florent Baty¹, Michel P Bihl, Guy Perrière, Aedín C Culhane, Martin H Brutsche

Affiliations

PMID: 16191195
PMCID: PMC1261161
DOI: 10.1186/1471-2105-6-239

Optimized between-group classification: a new jackknife-based gene selection procedure for genome-wide expression data

Florent Baty et al. BMC Bioinformatics. 2005.

. 2005 Sep 28:6:239.

doi: 10.1186/1471-2105-6-239.

Authors

Florent Baty¹, Michel P Bihl, Guy Perrière, Aedín C Culhane, Martin H Brutsche

Affiliation

¹ Pulmonary Gene Research, University Hospital Basel, CH-4031 Basel, Switzerland. florent.baty@unibas.ch

PMID: 16191195
PMCID: PMC1261161
DOI: 10.1186/1471-2105-6-239

Abstract

Background: A recent publication described a supervised classification method for microarray data: Between Group Analysis (BGA). This method which is based on performing multivariate ordination of groups proved to be very efficient for both classification of samples into pre-defined groups and disease class prediction of new unknown samples. Classification and prediction with BGA are classically performed using the whole set of genes and no variable selection is required. We hypothesize that an optimized selection of highly discriminating genes might improve the prediction power of BGA.

Results: We propose an optimized between-group classification (OBC) which uses a jackknife-based gene selection procedure. OBC emphasizes classification accuracy rather than feature selection. OBC is a backward optimization procedure that maximizes the percentage of between group inertia by removing the least influential genes one by one from the analysis. This selects a subset of highly discriminative genes which optimize disease class prediction. We apply OBC to four datasets and compared it to other classification methods.

Conclusion: OBC considerably improved the classification and predictive accuracy of BGA, when assessed using independent data sets and leave-one-out cross-validation.

Availability: The R code is freely available [see Additional file 1] as well as supplementary information [see Additional file 2].

PubMed Disclaimer

Figures

**Figure 1**
**Overall description of OBC**. Three steps are required to perform OBC optimization. In the first pre-selection step, n most discriminating genes are selected by performing a BGA on the training set with the whole set of genes. In the second step, a jackknife optimization is performed on the initial subset of genes and the least influential genes in terms of % BG inertia are removed successively. This second step is iteratively computed, decrementing the genes down to 5. Finally in the third step, the optimal subset of genes is identified (subset with the best classification accuracy and the best stability).

**Figure 2**
**Optimized between-group classification applied to sarcoidosis data**. In panel A, 24 individuals (solid circles) in the training set (H: healthy controls, SI: sarcoidosis stage I, SII: sarcoidosis stage II/III) and 8 individuals (empty circles) in the test set (283, 286, 287, 289 and 290 as H; 282, 284 and 285 as SII) are classified by a standard BGA using the whole set of genes. Panel B shows the different parameters of OBC as a function of the number of genes used in the analysis: the percentage of between group inertia (solid line), the percentage of good cross-validation (dashed line) and the variance of between group inertia (dot-dashed line). For indication, the percentage of test samples correctly predicted is represented by a dotted line. This parameter was not used in optimization of the training model. The vertical line shows the optimal number of genes. In panel C, the 105 most discriminating genes (initial subset) are located at the periphery of the biplot (black crosses) and the 58 optimal genes are highlighted (circled crosses). In panel D, 8 test-samples are classified using a BGA based on the 58 optimal genes.

**Figure 3**
**Optimized between-group classification applied to tumour data**. In panel A, 63 samples (solid circles) of the training set (BL: Burkitt's lymphoma, EWS: Ewing's sarcoma, NB: neuroblastoma, RMS: rhabdomyo sarcoma) and 25 samples (empty circles) of the test set (7, 15 and 18 as BL-NHL; 2, 6, 12, 19, 20 and 21 as EWS; 1, 8, 14, 16, 23 and 25 as NB; 4, 10, 17, 22 and 24 as RMS; 3, 5, 9, 11 and 13 as control samples that do not belong to one of the 4 groups) are classified by the standard BGA based on the whole set of genes. Panel B shows the different parameters of OBC as a function of the number of genes used in the analysis: the percentage of between group inertia (solid line), the percentage of good cross-validation (dashed line) and the variance of between group inertia (dot-dashed line). For indication, the percentage of test samples correctly predicted is represented by a dotted line. This parameter was not used in optimization of the training model. The vertical line shows the optimal number of genes. In panel C, the 245 most discriminating genes are represented with small crosses and the 90 optimal genes are highlighted (circled crosses). In panel D, the 25 test-samples are classified using a BGA based on the 90 optimal genes.

**Figure 4**
**Analysis of sensitivity and specificity**. The sensitivity and specificity of OBC (solid circles) were compared to standard BGA (empty circles). The prediction accuracy of OBC when applied to the sarcoidosis dataset was assessed using (A) LOOCV (left panel) and classification of the independent dataset (right panel). OBC was also applied to the tumour dataset and tested using (B) LOOCV (left panel) and classification of the independent dataset (right panel). Arrows show the improvement of sensitivity and specificity obtained with OBC compared to the standard BGA.

**Figure 5**
**Number of genes included in the initial subset**. This plot shows the maximum % BG inertia reached by the optimization procedure as a function of the number of genes present in the initial subset of genes (top curve: tumour data; bottom curve: sarcoidosis data). The dashed lines delimit the optimal size of the initial subset of genes for both datasets (above which the gain in % BG inertia is lower).

See this image and copyright information in PMC

Cited by

Stability of gene contributions and identification of outliers in multivariate analysis of microarray data.
Baty F, Jaeger D, Preiswerk F, Schumacher MM, Brutsche MH. Baty F, et al. BMC Bioinformatics. 2008 Jun 20;9:289. doi: 10.1186/1471-2105-9-289. BMC Bioinformatics. 2008. PMID: 18570644 Free PMC article.
Expression profiling in granulomatous lung disease.
Chen ES, Moller DR. Chen ES, et al. Proc Am Thorac Soc. 2007 Jan;4(1):101-7. doi: 10.1513/pats.200607-140JG. Proc Am Thorac Soc. 2007. PMID: 17202298 Free PMC article. Review.

References

1. Li L, Pedersen LG, Darden TA, Weinberg CR. Class prediction and discovery based on gene expression data. Genome Information Systems and Technology. 2001.
1. Yeung KY, Bumgarner RE. Multiclass classification of microarray data with repeated measurements: application to cancer. Genome Biol. 2003;4:R83. doi: 10.1186/gb-2003-4-12-r83. - DOI - PMC - PubMed
1. Li W, Sun F, Grosse I. Extreme value distribution based gene selection criteria for discriminant microarray data analysis using logistic regression. J Comput Biol. 2004;11:215–226. doi: 10.1089/1066527041410445. - DOI - PubMed
1. Tan Y, Shi L, Tong W, Wang C. Multi-class cancer classification by total principal component regression (TPCR) using microarray gene expression data. Nucleic Acids Res. 2005;33:56–65. doi: 10.1093/nar/gki144. - DOI - PMC - PubMed
1. Xiong M, Jin L, Li W, Boerwinkle E. Computational methods for gene expression-based tumor classification. Biotechniques. 2000;29:1264–8. 1270. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Optimized between-group classification: a new jackknife-based gene selection procedure for genome-wide expression data

Affiliation

Optimized between-group classification: a new jackknife-based gene selection procedure for genome-wide expression data

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Molecular Biology Databases

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Molecular Biology Databases