Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Feb;42(2):264-278.
doi: 10.1111/risa.13725. Epub 2021 Apr 16.

What is a Good Calibration Question?

Affiliations

What is a Good Calibration Question?

Victoria Hemming et al. Risk Anal. 2022 Feb.

Abstract

Weighted aggregation of expert judgments based on their performance on calibration questions may improve mathematically aggregated judgments relative to equal weights. However, obtaining validated, relevant calibration questions can be difficult. If so, should analysts settle for equal weights? Or should they use calibration questions that are easier to obtain but less relevant? In this article, we examine what happens to the out-of-sample performance of weighted aggregations of the classical model (CM) compared to equal weighted aggregations when the set of calibration questions includes many so-called "irrelevant" questions, those that might ordinarily be considered to be outside the domain of the questions of interest. We find that performance weighted aggregations outperform equal weights on the combined CM score, but not on statistical accuracy (i.e., calibration). Importantly, there was no appreciable difference in performance when weights were developed on relevant versus irrelevant questions. Experts were unable to adapt their knowledge across vastly different domains, and in-sample validation did not accurately predict out-of-sample performance on irrelevant questions. We suggest that if relevant calibration questions cannot be found, then analysts should use equal weights, and draw on alternative techniques to improve judgments. Our study also indicates limits to the predictive accuracy of performance weighted aggregation, and the degree to which expertise can be adapted across domains. We note limitations in our study and urge further research into the effect of question type on the reliability of performance weighted aggregations.

Keywords: Aggregation; calibration; equal weights; expert judgment; performance weights.

PubMed Disclaimer

References

    1. Aspinall, W. P. (2010). A route to more tractable expert advice. Nature, 463(7279), 294-295.
    1. Bamber, J., Aspinall, W., & Cooke, R. (2016). A commentary on “how to interpret expert judgment assessments of twenty-first century sea-level rise” by Hylke de Vries and Roderik SW van de Wal [journal article]. Climatic Change, 137(3), 321-328. https://doi.org/10.1007/s10584-016-1672-7
    1. Bedford, T., & Cooke, R. M. (2001). Mathematical tools for probabilistic risk analysis. Cambridge, UK: Cambridge University Press.
    1. Budescu, D. V., & Chen, E. (2014). Identifying expertise to extract the wisdom of crowds. Management Science, 61(2), 267-280.
    1. Burgman, M., Carr, A., Godden, L., Gregory, R., McBride, M., Flander, L., & Maguire, L. (2011). Redefining expertise and improving ecological judgment. Conservation Letters, 4(2), 81-87. https://doi.org/10.1111/j.1755-263X.2011.00165.x

Publication types

LinkOut - more resources