Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 May;27(5):1394-1409.
doi: 10.1177/0962280216661371. Epub 2016 Aug 8.

Calibration of medical diagnostic classifier scores to the probability of disease

Affiliations

Calibration of medical diagnostic classifier scores to the probability of disease

Weijie Chen et al. Stat Methods Med Res. 2018 May.

Abstract

Scores produced by statistical classifiers in many clinical decision support systems and other medical diagnostic devices are generally on an arbitrary scale, so the clinical meaning of these scores is unclear. Calibration of classifier scores to a meaningful scale such as the probability of disease is potentially useful when such scores are used by a physician. In this work, we investigated three methods (parametric, semi-parametric, and non-parametric) for calibrating classifier scores to the probability of disease scale and developed uncertainty estimation techniques for these methods. We showed that classifier scores on arbitrary scales can be calibrated to the probability of disease scale without affecting their discrimination performance. With a finite dataset to train the calibration function, it is important to accompany the probability estimate with its confidence interval. Our simulations indicate that, when a dataset used for finding the transformation for calibration is also used for estimating the performance of calibration, the resubstitution bias exists for a performance metric involving the truth states in evaluating the calibration performance. However, the bias is small for the parametric and semi-parametric methods when the sample size is moderate to large (>100 per class).

Keywords: Calibration; classifier; probability of disease; rationality.

PubMed Disclaimer

Conflict of interest statement

Declaration of conflicting interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Figures

Figure 1
Figure 1
Examples illustrating training of calibration functions: the parametric method gives a functional formula with parameters estimated from the training data; the semi- and non-parametric methods give an estimate of the probability of disease (i) for each score (yi) in the training data.
Figure 2
Figure 2
An example of calibration. The two-class score data were generated from a pair of normal distributions with 300 samples per class. The left panel plots the true calibration function (dot-dash line), the estimated calibration function (solid line) and the associated 95% CI (dash line). The right panel plots the true versus estimated probability for the finite dataset.
Figure 3
Figure 3
An example of calibration. The two-class score data were generated from a pair of beta distribution with 300 samples per class. The left panel plots the true calibration function (dot-dash line), the estimated calibration function (solid line) and the associated 95% CI (dash line). The right panel plots the true versus estimated probability for the finite dataset.
Figure 4
Figure 4
The average width of the 95% CI as a function of sample size for the three methods and for the normal distribution data (left) and beta distribution data (right) respectively.
Figure 5
Figure 5
Mean square error of calibrated probabilities with respect to the true probabilities for (a) normal distribution data, and (b) beta distribution data.
Figure 6
Figure 6
Brier score of calibrated probabilities for (a) normal distribution data, and (b) beta distribution data. The horizontal dot-dash line corresponds to the Brier score for perfectly calibrated scores (or infinitely trained calibration function).
Figure 7
Figure 7
Calibration of LDA classifier scores to the probability of being a nodule.

References

    1. Doi K. Computer-aided diagnosis in medical imaging: historical review, current status and future potential. Computer Med Imag Graph. 2007;31:198–211. - PMC - PubMed
    1. Shi L, Campbell G, Jones WD, et al. Consortium MAQC-II. The MAQC-II project: a comprehensive study of common practices for the development and validation of microarray-based predictive models. Nat Biotechnol. 2010;28:827–838. - PMC - PubMed
    1. Gail MH, Brinton LA, Byar DP, et al. Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. J Natl Canc Inst. 1989;81:1879–1886. - PubMed
    1. Jain A, Nandakumar K, Ross A. Score normalization in multimodal biometric systems. Pattern Recogn. 2005;38:2270–2285.
    1. Platt JC. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Smola AJ, Bartlett PJ, Schölkopf B, et al., editors. Advances in large margin classifiers. Cambridge, MA: MIT Press; 2000. pp. 61–74.