Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Sep 2:8:326.
doi: 10.1186/1471-2105-8-326.

Stratification bias in low signal microarray studies

Affiliations

Stratification bias in low signal microarray studies

Brian J Parker et al. BMC Bioinformatics. .

Abstract

Background: When analysing microarray and other small sample size biological datasets, care is needed to avoid various biases. We analyse a form of bias, stratification bias, that can substantially affect analyses using sample-reuse validation techniques and lead to inaccurate results. This bias is due to imperfect stratification of samples in the training and test sets and the dependency between these stratification errors, i.e. the variations in class proportions in the training and test sets are negatively correlated.

Results: We show that when estimating the performance of classifiers on low signal datasets (i.e. those which are difficult to classify), which are typical of many prognostic microarray studies, commonly used performance measures can suffer from a substantial negative bias. For error rate this bias is only severe in quite restricted situations, but can be much larger and more frequent when using ranking measures such as the receiver operating characteristic (ROC) curve and area under the ROC (AUC). Substantial biases are shown in simulations and on the van 't Veer breast cancer dataset. The classification error rate can have large negative biases for balanced datasets, whereas the AUC shows substantial pessimistic biases even for imbalanced datasets. In simulation studies using 10-fold cross-validation, AUC values of less than 0.3 can be observed on random datasets rather than the expected 0.5. Further experiments on the van 't Veer breast cancer dataset show these biases exist in practice.

Conclusion: Stratification bias can substantially affect several performance measures. In computing the AUC, the strategy of pooling the test samples from the various folds of cross-validation can lead to large biases; computing it as the average of per-fold estimates avoids this bias and is thus the recommended approach. As a more general solution applicable to other performance measures, we show that stratified repeated holdout and a modified version of k-fold cross-validation, balanced, stratified cross-validation and balanced leave-one-out cross-validation, avoids the bias. Therefore for model selection and evaluation of microarray and other small biological datasets, these methods should be used and unstratified versions avoided. In particular, the commonly used (unbalanced) leave-one-out cross-validation should not be used to estimate AUC for small datasets.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Simulation results averaged over 500 runs using DLDA and versions of CV for a random signal. The two classes have the same univariate Gaussian distribution (d' = 0), where the known mean and variance are used by the classifier. The number of samples is 30. The blue circles and green diamonds show the AUC computed using the pooling and averging strategy. The accuracy and balanced accuracy (1-BER) are shown as red triangles and black crosses.
Figure 2
Figure 2
Simulation results using DLDA and 10-fold unstratified CV for a weak signal. Same experimental setup as in additional figure 1(a), but d' = 0.5.
Figure 3
Figure 3
Simulation results using weighted SVM. The dataset was balanced through weighting by the inverse of the overall class proportions. Uses CV and SVM on a dataset of size 30 and discriminability d' = 0.5. The blue circles and red triangles show the AUC calculated using the pooling strategy and the classification accuracy.
Figure 5
Figure 5
Simulated classification results using DLDA (using known mean and variance) with varying separation of the means. The two classes have a univariate Gaussian distribution.
Figure 6
Figure 6
Simulated classification results using DLDA for varying class proportions. The two classes have a univariate Gaussian distribution and d' = 0.2. The mean and variance of the Gaussians were estimated from the data.
Figure 7
Figure 7
Simulated classification results for various induction algorithms. The two classes have a multivariate Gaussian distribution (10 dimensions). The discriminability d' is 1 and the data set contains 50 elements.
Figure 8
Figure 8
Same experimental setup as in figure 7, but AUC is calculated by averaging over the folds.
Figure 4
Figure 4
Correlation and covariance of class proportions between the training and test sets. The class proportions of the whole data sets are 0.5.
Figure 9
Figure 9
AUC estimates for van 't Veer breast cancer dataset using linear SVM.
Figure 10
Figure 10
Standard deviations of AUC estimates for van 't Veer dataset using linear SVM.
Figure 11
Figure 11
AUC and ROC curve estimates for randomised van 't Veer dataset. Error bars for ROC curves are 1 SE.
Figure 12
Figure 12
Error rate estimates for randomised van 't Veer dataset.

Similar articles

Cited by

References

    1. Simon R, Radmacher M, Dobbin K, McShane L. Pitfalls in the Use of DNA Microarray Data for Diagnostic and Prognostic Classification. Journal of the National Cancer Institute. 2003;95:14–18. - PubMed
    1. Dudoit S, Fridlyand J, Speed T. Comparison of discrimination methods for the classification of tumours using gene expression data. Journal of the American Statistical Association. 2002;97:77–87. doi: 10.1198/016214502753479248. - DOI
    1. Molinaro A, Simon R, Pfeiffer R. Prediction error estimation: A comparison of resampling methods. Bioinformatics. 2005;21:3301–3307. doi: 10.1093/bioinformatics/bti499. - DOI - PubMed
    1. Braga-Neto UM, Dougherty ER. Is cross-validation valid for small-sample microarray classification? Bioinformatics. 2004;20:374–380. doi: 10.1093/bioinformatics/btg419. - DOI - PubMed
    1. Dietterich T. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation. 2005;10:1895–1924. doi: 10.1162/089976698300017197. - DOI - PubMed

Publication types

MeSH terms