Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jul;52(5):258-265.
doi: 10.1177/01926233241259998. Epub 2024 Jun 22.

Inter-Rater and Intra-Rater Agreement in Scoring Severity of Rodent Cardiomyopathy and Relation to Artificial Intelligence-Based Scoring

Affiliations

Inter-Rater and Intra-Rater Agreement in Scoring Severity of Rodent Cardiomyopathy and Relation to Artificial Intelligence-Based Scoring

Thomas J Steinbach et al. Toxicol Pathol. 2024 Jul.

Abstract

We previously developed a computer-assisted image analysis algorithm to detect and quantify the microscopic features of rodent progressive cardiomyopathy (PCM) in rat heart histologic sections and validated the results with a panel of five veterinary toxicologic pathologists using a multinomial logistic model. In this study, we assessed both the inter-rater and intra-rater agreement of the pathologists and compared pathologists' ratings to the artificial intelligence (AI)-predicted scores. Pathologists and the AI algorithm were presented with 500 slides of rodent heart. They quantified the amount of cardiomyopathy in each slide. A total of 200 of these slides were novel to this study, whereas 100 slides were intentionally selected for repetition from the previous study. After a washout period of more than six months, the repeated slides were examined to assess intra-rater agreement among pathologists. We found the intra-rater agreement to be substantial, with weighted Cohen's kappa values ranging from k = 0.64 to 0.80. Intra-rater variability is not a concern for the deterministic AI. The inter-rater agreement across pathologists was moderate (Cohen's kappa k = 0.56). These results demonstrate the utility of AI algorithms as a tool for pathologists to increase sensitivity and specificity for the histopathologic assessment of the heart in toxicology studies.

Keywords: Sprague Dawley; artificial intelligence; cardiomyopathy; computer-assisted image analysis; deep learning; inter-rater agreement; intra-rater agreement; kappa; rat.

PubMed Disclaimer

Conflict of interest statement

Declaration of Conflicting InterestsThe author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Figures

Figure 1.
Figure 1.
Distribution of intra-rater reliability for the five veterinary pathologists using data from the 100 repeated slides. The points represent the level of agreement for each pathologist using accuracy (A) and Cohen’s kappa (B). In each panel, values are shown for strict agreement (Exact Grade in panel A, Unweighted Cohen’s kappa in panel B) and a tolerance for some margin of disagreement (within ± 1 grade in panel A, weighted Cohen’s kappa in panel B).
Figure 2.
Figure 2.
Intra-rater reliability agreement for the five veterinary pathologists by quintiles of AIA scores using data from the 100 repeated slides. Values for each rater (A, B, C, D, E) are shown for percent agreement for each quintile of AIA predicted score, where the upper bound of each quintile interval is shown on the horizontal axis. Mean values are shown in red +.
Figure 3.
Figure 3.
Distribution of pairwise inter-rater reliability measures. The points represent the level of agreement between each of the ten rater pairs using percent agreement (A) and Cohen’s kappa (B). In each panel, values are shown for strict agreement (Exact Grade in panel A, Unweighted Cohen’s kappa in panel B) and a tolerance for some margin of disagreement (within ±1 grade in panel A, weighted Cohen’s kappa in panel B). Boxplots are overlaid to show the distribution of the data.
Figure 4.
Figure 4.
Distribution of percent agreement across all pairs of raters. Horizontal lines represent the percent agreement between each of the ten rater pairs by deciles of AIA score (A) and by median grade severity (B). Mean values are shown in red +. In panel A, the upper bound of each decile interval is shown on the horizontal axis. In panel B, the median grade across all 5 raters is shown on the horizontal axis, with grades 4 and 5 combined.

References

    1. Aeffner F, Wilson K, Martin NT, et al. The gold standard paradox in digital image analysis: manual versus automated scoring as ground truth. Arch Pathol Lab Med. 2017;141(9):1267–1275. doi:10.5858/arpa.2016-0386-RA - DOI - PubMed
    1. Belluco S, Avallone G, Di Palma S, Rasotto R, Oevermann A. Inter- and intraobserver agreement of canine and feline nervous system tumors. Vet Pathol. 2019;56(3):342–349. doi:10.1177/0300985818824952 - DOI - PubMed
    1. Berner ES, Graber ML. Overconfidence as a cause of diagnostic error in medicine. Am J Med. 2008;121(5):S2–S23. doi:10.1016/j.amjmed.2008.01.001 - DOI - PubMed
    1. Chanut F, Kimbrough C, Hailey R, et al. Spontaneous cardiomyopathy in young Sprague-Dawley rats. Toxicol Pathol. 2013;41(8):1126–1136. doi:10.1177/0192623313478692 - DOI - PubMed
    1. Cohen J. A coefficient of agreement for nomial scales. Educ Psychol Meas. 1960;20(1):37–46. doi:10.1177/001316446002000104 - DOI

LinkOut - more resources