Assessing the performance of prediction models: a framework for traditional and novel measures

Ewout W Steyerberg¹, Andrew J Vickers, Nancy R Cook, Thomas Gerds, Mithat Gonen, Nancy Obuchowski, Michael J Pencina, Michael W Kattan

Affiliations

PMID: 20010215
PMCID: PMC3575184
DOI: 10.1097/EDE.0b013e3181c30fb2

Assessing the performance of prediction models: a framework for traditional and novel measures

Ewout W Steyerberg et al. Epidemiology. 2010 Jan.

. 2010 Jan;21(1):128-38.

doi: 10.1097/EDE.0b013e3181c30fb2.

Authors

Ewout W Steyerberg¹, Andrew J Vickers, Nancy R Cook, Thomas Gerds, Mithat Gonen, Nancy Obuchowski, Michael J Pencina, Michael W Kattan

Affiliation

¹ Department of Public Health, Erasmus MC, Rotterdam, The Netherlands. e.steyerberg@erasmusmc.nl

PMID: 20010215
PMCID: PMC3575184
DOI: 10.1097/EDE.0b013e3181c30fb2

Abstract

The performance of prediction models can be assessed using a variety of methods and metrics. Traditional measures for binary and survival outcomes include the Brier score to indicate overall model performance, the concordance (or c) statistic for discriminative ability (or area under the receiver operating characteristic [ROC] curve), and goodness-of-fit statistics for calibration.Several new measures have recently been proposed that can be seen as refinements of discrimination measures, including variants of the c statistic for survival, reclassification tables, net reclassification improvement (NRI), and integrated discrimination improvement (IDI). Moreover, decision-analytic measures have been proposed, including decision curves to plot the net benefit achieved by making decisions based on model predictions.We aimed to define the role of these relatively novel approaches in the evaluation of the performance of prediction models. For illustration, we present a case study of predicting the presence of residual tumor versus benign tissue in patients with testicular cancer (n = 544 for model development, n = 273 for external validation).We suggest that reporting discrimination and calibration will always be important for a prediction model. Decision-analytic measures should be reported if the predictive model is to be used for clinical decisions. Other measures of performance may be warranted in specific applications, such as reclassification metrics to gain insight into the value of adding a novel predictor to an established model.

PubMed Disclaimer

Figures

**Fig 1**
Receiver operating characteristic (ROC) curves for the predicted probabilities without (solid line) and with the tumor marker LDH (dashed line) in the development data set (left) and for the predicted probabilities without the tumor marker LDH from the development data set in the validation data set (right). Threshold probabilities are indicated.

**Fig 2**
Box plots of predicted probabilities without and with the tumor marker LDH. The discrimination slope is calculated as the difference between the mean predicted probability with and without residual tumor (solid dots indicate means). The difference between discrimination slopes is equivalent to integrated discrimination index (IDI=0.04).

**Fig 3**
Scatter plot of predicted probabilities without and with the tumor marker LDH (+: tumor; o: necrosis). Some patients with necrosis have higher predicted risks of tumor according to the model without LDH than according to the model with LDH (circles in right lower corner of the graph). For example, we note a patient with necrosis and an original prediction of nearly 60%, who is reclassified as less than 20% risk.

**Fig 4**
Decision curves for the predicted probabilities without (solid line) and with the tumor marker LDH (dashed line) in the development data set (left) and for the predicted probabilities without the tumor marker LDH from the development data set in the validation data set (right).

**Fig 5**
Validation plots of prediction models for residual masses in patients with testicular cancer without and with the tumor marker LDH. The arrow indicates the decision threshold of 20% risk of residual tumor.

See this image and copyright information in PMC

Comment in

The usefulness of mathematical models in assessing medical tests.
Kraemer HC. Kraemer HC. Epidemiology. 2010 Jan;21(1):139-41; discussion 142-3. doi: 10.1097/EDE.0b013e3181c42d60. Epidemiology. 2010. PMID: 20010216 No abstract available.

References

1. Harrell FE. Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis. New York: Springer; 2001.
1. Pepe MS, Janes H, Longton G, Leisenring W, Newcomb P. Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker. Am J Epidemiol. 2004;159(9):882–90. - PubMed
1. Gerds TA, Cai T, Schumacher M. The performance of risk prediction models. Biom J. 2008;50(4):457–79. - PubMed
1. Hosmer DW, Hosmer T, Le Cessie S, Lemeshow S. A comparison of goodness-of-fit tests for the logistic regression model. Stat Med. 1997;16(9):965–80. - PubMed
1. Obuchowski NA. Receiver operating characteristic curves and their use in radiology. Radiology. 2003;229(1):3–8. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Medical
- ClinicalTrials.gov

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Assessing the performance of prediction models: a framework for traditional and novel measures

Affiliation

Assessing the performance of prediction models: a framework for traditional and novel measures

Authors

Affiliation

Abstract

Figures

Comment in

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical