Calibrating random forests for probability estimation

Theresa Dankowski¹, Andreas Ziegler^{1

2

3

4}

Affiliations

¹ Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany.
² Zentrum für Klinische Studien, Universität zu Lübeck, Lübeck, Germany.
³ DZHK (German Centre for Cardiovascular Research), Hamburg/Kiel/Lübeck Partner Site, Lübeck, Germany.
⁴ School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Pietermaritzburg, South Africa.

PMID: 27074747
PMCID: PMC5074325
DOI: 10.1002/sim.6959

Calibrating random forests for probability estimation

Theresa Dankowski et al. Stat Med. 2016.

. 2016 Sep 30;35(22):3949-60.

doi: 10.1002/sim.6959. Epub 2016 Apr 13.

Authors

Theresa Dankowski¹, Andreas Ziegler^{1

2

3

4}

Affiliations

¹ Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany.
² Zentrum für Klinische Studien, Universität zu Lübeck, Lübeck, Germany.
³ DZHK (German Centre for Cardiovascular Research), Hamburg/Kiel/Lübeck Partner Site, Lübeck, Germany.
⁴ School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Pietermaritzburg, South Africa.

PMID: 27074747
PMCID: PMC5074325
DOI: 10.1002/sim.6959

Abstract

Probabilities can be consistently estimated using random forests. It is, however, unclear how random forests should be updated to make predictions for other centers or at different time points. In this work, we present two approaches for updating random forests for probability estimation. The first method has been proposed by Elkan and may be used for updating any machine learning approach yielding consistent probabilities, so-called probability machines. The second approach is a new strategy specifically developed for random forests. Using the terminal nodes, which represent conditional probabilities, the random forest is first translated to logistic regression models. These are, in turn, used for re-calibration. The two updating strategies were compared in a simulation study and are illustrated with data from the German Stroke Study Collaboration. In most simulation scenarios, both methods led to similar improvements. In the simulation scenario in which the stricter assumptions of Elkan's method were not met, the logistic regression-based re-calibration approach for random forests outperformed Elkan's method. It also performed better on the stroke data than Elkan's method. The strength of Elkan's method is its general applicability to any probability machine. However, if the strict assumptions underlying this approach are not met, the logistic regression-based approach is preferable for updating random forests for probability estimation. © 2016 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd.

Keywords: calibration; logistic regression; probability estimation; probability machine; random forests; updating.

PubMed Disclaimer

Figures

**Figure 1**
Example of a probability estimation tree for a dichotomous outcome y given covariates x ₁ and x ₂. Split points are c ₁ and c ₂, and the terminal nodes are labeled t ₁, t ₂, and t ₃.

**Figure 2**
Steps in logistic regression‐based updating approach for random forests.

**Figure 3**
True versus predicted probabilities for simulation scenario 1. RF: random forest built on model‐building data; RF + LogReg: RF translated to logistic regression models; RF + LogReg + Cal: RF + LogReg updated using re‐calibration; RF + CalElkan: probabilities from RF updated using Elkan's method; RF onCalData: random forest built on cal‐training data; LogReg + Cal: logistic regression fitted using model‐building data and updated using re‐calibration.

**Figure 4**
Mean‐squared errors between true and predicted probabilities in the simulation study. RF: random forest built on model‐building data; RF + LogReg: RF translated to logistic regression models; RF + LogReg + Cal: RF + LogReg updated using re‐calibration; RF + CalElkan: probabilities from RF updated using Elkan's method; RF onCalData: random forest built on cal‐training data; LogReg + Cal: logistic regression fitted using model‐building data and updated using re‐calibration.

**Figure 5**
Mean‐squared errors between true and predicted probabilities in simulation scenario 5, where the calibration data for training were substantially smaller than in the other simulation scenarios. RF: random forest built on model‐building data; RF + LogReg: RF translated to logistic regression models; RF + LogReg + Cal: RF + LogReg updated using re‐calibration; RF + CalElkan: probabilities from RF updated using Elkan's method; RF onCalData: random forest built on cal‐training data; LogReg + Cal: logistic regression fitted using model‐building data and updated using re‐calibration.

**Figure 6**
True versus predicted probabilities for simulation scenario 6. RF: random forest built on model‐building data; RF + LogReg: RF translated to logistic regression models; RF + LogReg + Cal: RF + LogReg updated using re‐calibration; RF + CalElkan: probabilities from RF updated using Elkan's method; RF onCalData: random forest built on cal‐training data; LogReg + Cal: logistic regression fitted using model‐building data and updated using re‐calibration.

**Figure 7**
Calibration curves for external stroke validation data. RF: random forest built on training data; RF + CalElkan: probabilities from RF updated using Elkan's method; RF + LogReg + Cal: RF translated to logistic regression models and updated using re‐calibration.

See this image and copyright information in PMC

References

1. Justice AC, Covinsky KE, Berlin JA. Assessing the generalizability of prognostic information. Annals of Internal Medicine. 1999; 130(6):515–524. - PubMed
1. Steyerberg EW, Borsboom GJJM, van Houwelingen HC, Eijkemans MJC, Habbema JDF. Validation and updating of predictive logistic regression models: a study on sample size and shrinkage. Statistics in Medicine. 2004; 23(16):2567–2586. - PubMed
1. Boström H. Calibrating random forests. In Proceedings of the Seventh International Conference on Machine Learning and Applications: Piscataway, NJ, 2008; 121–126.
1. Breiman L. Random forests. Machine Learning. 2001; 45(1):5–32.
1. Breiman L, Friedman J, Olshen RA, Stone CJ. Classification and Regression Trees. Chapman & Hall/CRC: Boca Raton, FL, 1984.

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Other Literature Sources
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Calibrating random forests for probability estimation

Affiliations

Calibrating random forests for probability estimation

Authors

Affiliations

Abstract

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous