Measuring inter-rater reliability for nominal data - which coefficients and confidence intervals are appropriate?

Antonia Zapf¹, Stefanie Castell², Lars Morawietz³, André Karch^{4

5}

Affiliations

¹ Department of Medical Statistics, University Medical Center Göttingen, Humboldtallee 32, 37073, Göttingen, Germany. Antonia.Zapf@med.uni-goettingen.de.
² Department of Epidemiology, Helmholtz Centre for Infection Research, Inhoffenstrasse 7, 38124, Braunschweig, Germany.
³ Institute of Pathology, Diagnostik Ernst von Bergmann GmbH, Charlottenstr. 72, 14467, Potsdam, Germany.
⁴ ESME - Research Group Epidemiological and Statistical Methods, Helmholtz Centre for Infection Research, Inhoffenstrasse 7, 38124, Braunschweig, Germany.
⁵ German Center for Infection Research, Hannover-Braunschweig site, Göttingen, Germany.

PMID: 27495131
PMCID: PMC4974794
DOI: 10.1186/s12874-016-0200-9

Measuring inter-rater reliability for nominal data - which coefficients and confidence intervals are appropriate?

Antonia Zapf et al. BMC Med Res Methodol. 2016.

. 2016 Aug 5:16:93.

doi: 10.1186/s12874-016-0200-9.

Authors

Antonia Zapf¹, Stefanie Castell², Lars Morawietz³, André Karch^{4

5}

Affiliations

¹ Department of Medical Statistics, University Medical Center Göttingen, Humboldtallee 32, 37073, Göttingen, Germany. Antonia.Zapf@med.uni-goettingen.de.
² Department of Epidemiology, Helmholtz Centre for Infection Research, Inhoffenstrasse 7, 38124, Braunschweig, Germany.
³ Institute of Pathology, Diagnostik Ernst von Bergmann GmbH, Charlottenstr. 72, 14467, Potsdam, Germany.
⁴ ESME - Research Group Epidemiological and Statistical Methods, Helmholtz Centre for Infection Research, Inhoffenstrasse 7, 38124, Braunschweig, Germany.
⁵ German Center for Infection Research, Hannover-Braunschweig site, Göttingen, Germany.

PMID: 27495131
PMCID: PMC4974794
DOI: 10.1186/s12874-016-0200-9

Abstract

Background: Reliability of measurements is a prerequisite of medical research. For nominal data, Fleiss' kappa (in the following labelled as Fleiss' K) and Krippendorff's alpha provide the highest flexibility of the available reliability measures with respect to number of raters and categories. Our aim was to investigate which measures and which confidence intervals provide the best statistical properties for the assessment of inter-rater reliability in different situations.

Methods: We performed a large simulation study to investigate the precision of the estimates for Fleiss' K and Krippendorff's alpha and to determine the empirical coverage probability of the corresponding confidence intervals (asymptotic for Fleiss' K and bootstrap for both measures). Furthermore, we compared measures and confidence intervals in a real world case study.

Results: Point estimates of Fleiss' K and Krippendorff's alpha did not differ from each other in all scenarios. In the case of missing data (completely at random), Krippendorff's alpha provided stable estimates, while the complete case analysis approach for Fleiss' K led to biased estimates. For shifted null hypotheses, the coverage probability of the asymptotic confidence interval for Fleiss' K was low, while the bootstrap confidence intervals for both measures provided a coverage probability close to the theoretical one.

Conclusions: Fleiss' K and Krippendorff's alpha with bootstrap confidence intervals are equally suitable for the analysis of reliability of complete nominal data. The asymptotic confidence interval for Fleiss' K should not be used. In the case of missing data or data or higher than nominal order, Krippendorff's alpha is recommended. Together with this article, we provide an R-script for calculating Fleiss' K and Krippendorff's alpha and their corresponding bootstrap confidence intervals.

Keywords: Bootstrap; Confidence interval; Fleiss’ K; Fleiss’ kappa; Inter-rater heterogeneity; Krippendorff’s alpha.

PubMed Disclaimer

Figures

**Fig. 1**
Distribution of the true values in the 27 scenarios (independent of the sample size)

**Fig. 2**
Percentage bias for Krippendorff’s alpha and Fleiss’ K over all 81 scenarios. The dotted line indicates unbiasedness. On the left side the whole range from −100 to +100 % is displayed, on the right side the relevant excerpt is enlarged

**Fig. 3**
Two-sided empirical type-one error of the three approaches over all 81 scenarios. The dotted line indicates the theoretical coverage probability of 95 %

**Fig. 4**
Empirical coverage probability for the bootstrap intervals for Krippendorff’s alpha and Fleiss’ K with varying factors sample size (a), number of categories (b), number of raters (c) and strength of agreement (d). In each subplot, summary results over all levels of the other factors are displayed. The dashed line indicates the theoretical coverage probability of 95 %

See this image and copyright information in PMC

References

1. Gwet KL. Handbook of Inter-Rater Reliability. 3. USA: Advanced Analytics, LLC; 2012.
1. Michels KB. A renaissance for measurement error. Int J Epidemiol. 2001;30(3):421–2. doi: 10.1093/ije/30.3.421. - DOI - PubMed
1. Roger VL, Boerwinkle E, Crapo JD, et al. Strategic transformation of population studies: recommendations of the working group on epidemiology and population sciences from the National Heart, Lung, and Blood Advisory Council and Board of External Experts. Am J Epidemiol. 2015;181(6):363–8. doi: 10.1093/aje/kwv011. - DOI - PMC - PubMed
1. Scott WA. Reliability of content analysis: the case of nominal scale coding. Public Opinion Quarterly. 1955;XIX:321–5. doi: 10.1086/266577. - DOI
1. Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960;20:37–46. doi: 10.1177/001316446002000104. - DOI

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Medical
- ClinicalTrials.gov

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Measuring inter-rater reliability for nominal data - which coefficients and confidence intervals are appropriate?

Affiliations

Measuring inter-rater reliability for nominal data - which coefficients and confidence intervals are appropriate?

Authors

Affiliations

Abstract

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical