Statistical testing of agreement in overlap-based performance between an AI segmentation device and a multi-expert human panel without requiring a reference standard

Tingting Hu¹, Berkman Sahiner¹, Shuyue Guan¹, Mike Mikailov¹, Kenny Cha¹, Frank Samuelson¹, Nicholas Petrick¹

Affiliations

PMID: 41132782
PMCID: PMC12543030 (available on 2026-10-22)
DOI: 10.1117/1.JMI.12.5.055003

Statistical testing of agreement in overlap-based performance between an AI segmentation device and a multi-expert human panel without requiring a reference standard

Tingting Hu et al. J Med Imaging (Bellingham). 2025 Sep.

. 2025 Sep;12(5):055003.

doi: 10.1117/1.JMI.12.5.055003. Epub 2025 Oct 22.

Authors

Tingting Hu¹, Berkman Sahiner¹, Shuyue Guan¹, Mike Mikailov¹, Kenny Cha¹, Frank Samuelson¹, Nicholas Petrick¹

Affiliation

¹ U.S. Food and Drug Administration, Silver Spring, Maryland, United States.

PMID: 41132782
PMCID: PMC12543030 (available on 2026-10-22)
DOI: 10.1117/1.JMI.12.5.055003

Abstract

Purpose: Artificial intelligence (AI)-based medical imaging devices often include lesion or organ segmentation capabilities. Existing methods for segmentation performance evaluation compare AI results with an aggregated reference standard using accuracy metrics such as the Dice coefficient or Hausdorff distance. However, these approaches are limited by lacking a gold standard and challenges in defining meaningful success criteria. To address this, we developed a statistical method to assess agreement between an AI device and multiple human experts without requiring a reference standard.

Approach: We propose a paired-testing method to evaluate whether an AI device's segmentation performance significantly differs from that of multiple human experts. The method compares device-to-expert dissimilarity with expert-to-expert dissimilarity, avoiding the need for a reference standard. We validated the method through (1) statistical simulations where the Dice coefficient performance is either shared ("overlap agreeable") or not shared ("overlap disagreeable") between the device and experts; (2) image-based simulations using 2D contours with shared or nonshared transformation parameters (transformation agreeable or disagreeable). We also applied the method to compare an AI segmentation algorithm with four radiologists using data from the Lung Image Database Consortium.

Results: Statistical simulations show the method controls type I error ( $\sim 0.05$ ) for overlap-agreeable and type II error ( $\sim 0$ ) for overlap-disagreeable scenarios. Image-based simulations show acceptable performance with a mean type I error of 0.07 (SD 0.03) for transformation-agreeable and a mean type II error of 0.07 (SD 0.18) for transformation-disagreeable cases.

Conclusions: The paired-testing method offers a new tool for assessing the agreement between an AI segmentation device and multiple human expert panelists without requiring a reference standard.

Keywords: multi-expert human panel; paired testing; segmentation assessment.

Published by SPIE.

PubMed Disclaimer

References

1. Taha A. A., Hanbury A., “Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool,” BMC Med. Imaging 15(1), 1–28 (2015). 10.1186/s12880-015-0068-x - DOI - PMC - PubMed
1. Wang Z., Wang E., Zhu Y., “Image segmentation evaluation: a survey of methods,” Artif. Intell. Rev. 53(8), 5637–5674 (2020). 10.1007/s10462-020-09830-9 - DOI
1. Udupa J. K., et al. , “A framework for evaluating image segmentation algorithms,” Comput. Med. Imaging Graphics 30(2), 75–87 (2006). 10.1016/j.compmedimag.2005.12.001 - DOI - PubMed
1. Cárdenes R., de Luis-Garcia R., Bach-Cuadra M., “A multidimensional segmentation evaluation for medical image data,” Comput. Methods Prog. Biomed. 96(2), 108–124 (2009). 10.1016/j.cmpb.2009.04.009 - DOI - PubMed
1. Dice L. R., “Measures of the amount of ecologic association between species,” Ecology 26(3), 297–302 (1945). 10.2307/1932409 - DOI

LinkOut - more resources

Full Text Sources
- Society of Photo-Optical Instrumentation Engineers

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Statistical testing of agreement in overlap-based performance between an AI segmentation device and a multi-expert human panel without requiring a reference standard

Affiliation

Statistical testing of agreement in overlap-based performance between an AI segmentation device and a multi-expert human panel without requiring a reference standard

Authors

Affiliation

Abstract

References

LinkOut - more resources

Full Text Sources