. 2007 Sep 2:8:326.

doi: 10.1186/1471-2105-8-326.

Stratification bias in low signal microarray studies

Brian J Parker¹, Simon Günter, Justin Bedo

Affiliations

PMID: 17764577
PMCID: PMC2211509
DOI: 10.1186/1471-2105-8-326

Stratification bias in low signal microarray studies

Brian J Parker et al. BMC Bioinformatics. 2007.

. 2007 Sep 2:8:326.

doi: 10.1186/1471-2105-8-326.

Authors

Brian J Parker¹, Simon Günter, Justin Bedo

Affiliation

¹ Statistical Machine Learning Group, NICTA, Canberra, Australia. brian.bj.parker@gmail.com

PMID: 17764577
PMCID: PMC2211509
DOI: 10.1186/1471-2105-8-326

Abstract

Background: When analysing microarray and other small sample size biological datasets, care is needed to avoid various biases. We analyse a form of bias, stratification bias, that can substantially affect analyses using sample-reuse validation techniques and lead to inaccurate results. This bias is due to imperfect stratification of samples in the training and test sets and the dependency between these stratification errors, i.e. the variations in class proportions in the training and test sets are negatively correlated.

Results: We show that when estimating the performance of classifiers on low signal datasets (i.e. those which are difficult to classify), which are typical of many prognostic microarray studies, commonly used performance measures can suffer from a substantial negative bias. For error rate this bias is only severe in quite restricted situations, but can be much larger and more frequent when using ranking measures such as the receiver operating characteristic (ROC) curve and area under the ROC (AUC). Substantial biases are shown in simulations and on the van 't Veer breast cancer dataset. The classification error rate can have large negative biases for balanced datasets, whereas the AUC shows substantial pessimistic biases even for imbalanced datasets. In simulation studies using 10-fold cross-validation, AUC values of less than 0.3 can be observed on random datasets rather than the expected 0.5. Further experiments on the van 't Veer breast cancer dataset show these biases exist in practice.

Conclusion: Stratification bias can substantially affect several performance measures. In computing the AUC, the strategy of pooling the test samples from the various folds of cross-validation can lead to large biases; computing it as the average of per-fold estimates avoids this bias and is thus the recommended approach. As a more general solution applicable to other performance measures, we show that stratified repeated holdout and a modified version of k-fold cross-validation, balanced, stratified cross-validation and balanced leave-one-out cross-validation, avoids the bias. Therefore for model selection and evaluation of microarray and other small biological datasets, these methods should be used and unstratified versions avoided. In particular, the commonly used (unbalanced) leave-one-out cross-validation should not be used to estimate AUC for small datasets.

PubMed Disclaimer

Figures

**Figure 1**
**Simulation results averaged over 500 runs using DLDA and versions of CV for a random signal**. The two classes have the same univariate Gaussian distribution (d' = 0), where the known mean and variance are used by the classifier. The number of samples is 30. The blue circles and green diamonds show the AUC computed using the pooling and averging strategy. The accuracy and balanced accuracy (1-BER) are shown as red triangles and black crosses.

**Figure 2**
**Simulation results using DLDA and 10-fold unstratified CV for a weak signal**. Same experimental setup as in additional figure 1(a), but d' = 0.5.

**Figure 3**
**Simulation results using weighted SVM**. The dataset was balanced through weighting by the inverse of the overall class proportions. Uses CV and SVM on a dataset of size 30 and discriminability d' = 0.5. The blue circles and red triangles show the AUC calculated using the pooling strategy and the classification accuracy.

**Figure 5**
**Simulated classification results using DLDA (using known mean and variance) with varying separation of the means**. The two classes have a univariate Gaussian distribution.

**Figure 6**
**Simulated classification results using DLDA for varying class proportions**. The two classes have a univariate Gaussian distribution and d' = 0.2. The mean and variance of the Gaussians were estimated from the data.

**Figure 7**
**Simulated classification results for various induction algorithms**. The two classes have a multivariate Gaussian distribution (10 dimensions). The discriminability d' is 1 and the data set contains 50 elements.

**Figure 8**
Same experimental setup as in figure 7, but AUC is calculated by averaging over the folds.

**Figure 4**
**Correlation and covariance of class proportions between the training and test sets**. The class proportions of the whole data sets are 0.5.

**Figure 9**
AUC estimates for van 't Veer breast cancer dataset using linear SVM.

**Figure 10**
Standard deviations of AUC estimates for van 't Veer dataset using linear SVM.

**Figure 11**
**AUC and ROC curve estimates for randomised van 't Veer dataset**. Error bars for ROC curves are 1 SE.

**Figure 12**
Error rate estimates for randomised van 't Veer dataset.

See this image and copyright information in PMC

References

1. Simon R, Radmacher M, Dobbin K, McShane L. Pitfalls in the Use of DNA Microarray Data for Diagnostic and Prognostic Classification. Journal of the National Cancer Institute. 2003;95:14–18. - PubMed
1. Dudoit S, Fridlyand J, Speed T. Comparison of discrimination methods for the classification of tumours using gene expression data. Journal of the American Statistical Association. 2002;97:77–87. doi: 10.1198/016214502753479248. - DOI
1. Molinaro A, Simon R, Pfeiffer R. Prediction error estimation: A comparison of resampling methods. Bioinformatics. 2005;21:3301–3307. doi: 10.1093/bioinformatics/bti499. - DOI - PubMed
1. Braga-Neto UM, Dougherty ER. Is cross-validation valid for small-sample microarray classification? Bioinformatics. 2004;20:374–380. doi: 10.1093/bioinformatics/btg419. - DOI - PubMed
1. Dietterich T. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation. 2005;10:1895–1924. doi: 10.1162/089976698300017197. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Stratification bias in low signal microarray studies

Affiliation

Stratification bias in low signal microarray studies

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Medical

Research Materials