Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Sep;12(5):055003.
doi: 10.1117/1.JMI.12.5.055003. Epub 2025 Oct 22.

Statistical testing of agreement in overlap-based performance between an AI segmentation device and a multi-expert human panel without requiring a reference standard

Affiliations

Statistical testing of agreement in overlap-based performance between an AI segmentation device and a multi-expert human panel without requiring a reference standard

Tingting Hu et al. J Med Imaging (Bellingham). 2025 Sep.

Abstract

Purpose: Artificial intelligence (AI)-based medical imaging devices often include lesion or organ segmentation capabilities. Existing methods for segmentation performance evaluation compare AI results with an aggregated reference standard using accuracy metrics such as the Dice coefficient or Hausdorff distance. However, these approaches are limited by lacking a gold standard and challenges in defining meaningful success criteria. To address this, we developed a statistical method to assess agreement between an AI device and multiple human experts without requiring a reference standard.

Approach: We propose a paired-testing method to evaluate whether an AI device's segmentation performance significantly differs from that of multiple human experts. The method compares device-to-expert dissimilarity with expert-to-expert dissimilarity, avoiding the need for a reference standard. We validated the method through (1) statistical simulations where the Dice coefficient performance is either shared ("overlap agreeable") or not shared ("overlap disagreeable") between the device and experts; (2) image-based simulations using 2D contours with shared or nonshared transformation parameters (transformation agreeable or disagreeable). We also applied the method to compare an AI segmentation algorithm with four radiologists using data from the Lung Image Database Consortium.

Results: Statistical simulations show the method controls type I error ( 0.05 ) for overlap-agreeable and type II error ( 0 ) for overlap-disagreeable scenarios. Image-based simulations show acceptable performance with a mean type I error of 0.07 (SD 0.03) for transformation-agreeable and a mean type II error of 0.07 (SD 0.18) for transformation-disagreeable cases.

Conclusions: The paired-testing method offers a new tool for assessing the agreement between an AI segmentation device and multiple human expert panelists without requiring a reference standard.

Keywords: multi-expert human panel; paired testing; segmentation assessment.

PubMed Disclaimer

References

    1. Taha A. A., Hanbury A., “Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool,” BMC Med. Imaging 15(1), 1–28 (2015). 10.1186/s12880-015-0068-x - DOI - PMC - PubMed
    1. Wang Z., Wang E., Zhu Y., “Image segmentation evaluation: a survey of methods,” Artif. Intell. Rev. 53(8), 5637–5674 (2020). 10.1007/s10462-020-09830-9 - DOI
    1. Udupa J. K., et al. , “A framework for evaluating image segmentation algorithms,” Comput. Med. Imaging Graphics 30(2), 75–87 (2006). 10.1016/j.compmedimag.2005.12.001 - DOI - PubMed
    1. Cárdenes R., de Luis-Garcia R., Bach-Cuadra M., “A multidimensional segmentation evaluation for medical image data,” Comput. Methods Prog. Biomed. 96(2), 108–124 (2009). 10.1016/j.cmpb.2009.04.009 - DOI - PubMed
    1. Dice L. R., “Measures of the amount of ecologic association between species,” Ecology 26(3), 297–302 (1945). 10.2307/1932409 - DOI

LinkOut - more resources