Calibration of medical diagnostic classifier scores to the probability of disease

Weijie Chen¹, Berkman Sahiner¹, Frank Samuelson¹, Aria Pezeshk¹, Nicholas Petrick¹

Affiliations

PMID: 27507287
PMCID: PMC5548655
DOI: 10.1177/0962280216661371

Calibration of medical diagnostic classifier scores to the probability of disease

Weijie Chen et al. Stat Methods Med Res. 2018 May.

. 2018 May;27(5):1394-1409.

doi: 10.1177/0962280216661371. Epub 2016 Aug 8.

Authors

Weijie Chen¹, Berkman Sahiner¹, Frank Samuelson¹, Aria Pezeshk¹, Nicholas Petrick¹

Affiliation

¹ Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, Food and Drug Administration, Silver Spring, USA.

PMID: 27507287
PMCID: PMC5548655
DOI: 10.1177/0962280216661371

Abstract

Scores produced by statistical classifiers in many clinical decision support systems and other medical diagnostic devices are generally on an arbitrary scale, so the clinical meaning of these scores is unclear. Calibration of classifier scores to a meaningful scale such as the probability of disease is potentially useful when such scores are used by a physician. In this work, we investigated three methods (parametric, semi-parametric, and non-parametric) for calibrating classifier scores to the probability of disease scale and developed uncertainty estimation techniques for these methods. We showed that classifier scores on arbitrary scales can be calibrated to the probability of disease scale without affecting their discrimination performance. With a finite dataset to train the calibration function, it is important to accompany the probability estimate with its confidence interval. Our simulations indicate that, when a dataset used for finding the transformation for calibration is also used for estimating the performance of calibration, the resubstitution bias exists for a performance metric involving the truth states in evaluating the calibration performance. However, the bias is small for the parametric and semi-parametric methods when the sample size is moderate to large (>100 per class).

Keywords: Calibration; classifier; probability of disease; rationality.

PubMed Disclaimer

Conflict of interest statement

Declaration of conflicting interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Figures

**Figure 1**
Examples illustrating training of calibration functions: the parametric method gives a functional formula with parameters estimated from the training data; the semi- and non-parametric methods give an estimate of the probability of disease (*P̂_i*) for each score (*y_i*) in the training data.

**Figure 2**
An example of calibration. The two-class score data were generated from a pair of normal distributions with 300 samples per class. The left panel plots the true calibration function (dot-dash line), the estimated calibration function (solid line) and the associated 95% CI (dash line). The right panel plots the true versus estimated probability for the finite dataset.

**Figure 3**
An example of calibration. The two-class score data were generated from a pair of beta distribution with 300 samples per class. The left panel plots the true calibration function (dot-dash line), the estimated calibration function (solid line) and the associated 95% CI (dash line). The right panel plots the true versus estimated probability for the finite dataset.

**Figure 4**
The average width of the 95% CI as a function of sample size for the three methods and for the normal distribution data (left) and beta distribution data (right) respectively.

**Figure 5**
Mean square error of calibrated probabilities with respect to the true probabilities for (a) normal distribution data, and (b) beta distribution data.

**Figure 6**
Brier score of calibrated probabilities for (a) normal distribution data, and (b) beta distribution data. The horizontal dot-dash line corresponds to the Brier score for perfectly calibrated scores (or infinitely trained calibration function).

**Figure 7**
Calibration of LDA classifier scores to the probability of being a nodule.

See this image and copyright information in PMC

References

1. Doi K. Computer-aided diagnosis in medical imaging: historical review, current status and future potential. Computer Med Imag Graph. 2007;31:198–211. - PMC - PubMed
1. Shi L, Campbell G, Jones WD, et al. Consortium MAQC-II. The MAQC-II project: a comprehensive study of common practices for the development and validation of microarray-based predictive models. Nat Biotechnol. 2010;28:827–838. - PMC - PubMed
1. Gail MH, Brinton LA, Byar DP, et al. Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. J Natl Canc Inst. 1989;81:1879–1886. - PubMed
1. Jain A, Nandakumar K, Ross A. Score normalization in multimodal biometric systems. Pattern Recogn. 2005;38:2270–2285.
1. Platt JC. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Smola AJ, Bartlett PJ, Schölkopf B, et al., editors. Advances in large margin classifiers. Cambridge, MA: MIT Press; 2000. pp. 61–74.

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

FD999999/ImFDA/Intramural FDA HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Calibration of medical diagnostic classifier scores to the probability of disease

Affiliation

Calibration of medical diagnostic classifier scores to the probability of disease

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical