Using cross-validation to evaluate predictive accuracy of survival risk classifiers based on high-dimensional data

Richard M Simon¹, Jyothi Subramanian, Ming-Chung Li, Supriya Menezes

Affiliations

PMID: 21324971
PMCID: PMC3105299
DOI: 10.1093/bib/bbr001

Using cross-validation to evaluate predictive accuracy of survival risk classifiers based on high-dimensional data

Richard M Simon et al. Brief Bioinform. 2011 May.

. 2011 May;12(3):203-14.

doi: 10.1093/bib/bbr001. Epub 2011 Feb 15.

Authors

Richard M Simon¹, Jyothi Subramanian, Ming-Chung Li, Supriya Menezes

Affiliation

¹ Biometric Research Branch, US National Cancer Institute, Bethesda, MD 20892-7434, USA. rsimon@nih.gov

PMID: 21324971
PMCID: PMC3105299
DOI: 10.1093/bib/bbr001

Abstract

Developments in whole genome biotechnology have stimulated statistical focus on prediction methods. We review here methodology for classifying patients into survival risk groups and for using cross-validation to evaluate such classifications. Measures of discrimination for survival risk models include separation of survival curves, time-dependent ROC curves and Harrell's concordance index. For high-dimensional data applications, however, computing these measures as re-substitution statistics on the same data used for model development results in highly biased estimates. Most developments in methodology for survival risk modeling with high-dimensional data have utilized separate test data sets for model evaluation. Cross-validation has sometimes been used for optimization of tuning parameters. In many applications, however, the data available are too limited for effective division into training and test sets and consequently authors have often either reported re-substitution statistics or analyzed their data using binary classification methods in order to utilize familiar cross-validation. In this article we have tried to indicate how to utilize cross-validation for the evaluation of survival risk models; specifically how to compute cross-validated estimates of survival distributions for predicted risk groups and how to compute cross-validated time-dependent ROC curves. We have also discussed evaluation of the statistical significance of a survival risk model and evaluation of whether high-dimensional genomic data adds predictive accuracy to a model based on standard covariates alone.

PubMed Disclaimer

Figures

**Figure 1:**
Re-substitution Kaplan–Meier survival estimates for cases in the training set classified as high- or low-risk based on survival risk models developed in the same training set. The training set data were simulated from a model in which none of the variables used for modeling were actually prognostic for survival. Each simulation is numbered.

**Figure 2:**
Kaplan–Meier survival estimates for cases in independent test sets classified as high- or low-risk using the same models developed in the corresponding training sets shown in Figure 1. Data for the independent test sets was simulated from the same model used for simulating data for the training sets; none of the variables used for modeling was prognostic for survival.

**Figure 3:**
Cross-validated Kaplan–Meier survival estimates for the training sets shown in Figure 1.

**Figure 4:**
Kaplan–Meier survival curves for the data from Shedden *et al*. [18]. (A) Re-substitution estimates and (B) cross-validated estimates.

**Figure 5:**
Time dependent ROC curves for the data from Shedden *et al*. [18]. (A) Re-substitution estimates and (B) cross-validated estimates. The resubstitution area under the curve (AUC) is 0.79 and the cross-validated AUC is 0.53.

**Figure 6:**
Cross-validated Kaplan–Meier curves to compare the prognostic model containing only standard covariates with the model containing both standard covariates and gene expression variables in the data set from Shedden *et al.* [18]. (A) Only standard covariates and (B) standard covariates and gene expression variables.

**Figure 7:**
Cross-validated time dependent ROC curves to compare the prognostic model containing only standard covariates with the model containing both standard covariates and gene expression variables in the data set from Shedden *et al*. [18].

See this image and copyright information in PMC

References

1. Simon R, Radmacher MD, Dobbin K, et al. Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J Natl Cancer Institute. 2003;95:14–18. - PubMed
1. Dobbin K, Simon R. Sample size planning for developing classifiers using high dimensional DNA microarray data. Biostatistics. 2007;8:101–17. - PubMed
1. Molinaro AM, Simon R, Pfeiffer RM. Prediction error estimation: a comparison of resampling methods. Bioinformatics. 2005;21:3301–7. - PubMed
1. Dupuy A, Simon R. Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. J Natl Cancer Institute. 2007;99:147–57. - PubMed
1. Subramanian J, Simon R. Gene expression-based prognostic signatures in lung cancer: ready for clinical use? J Natl Cancer Institute. 2010;102:464–74. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Using cross-validation to evaluate predictive accuracy of survival risk classifiers based on high-dimensional data

Affiliation

Using cross-validation to evaluate predictive accuracy of survival risk classifiers based on high-dimensional data

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources