Statistical testing of agreement in overlap-based performance between an AI segmentation device and a multi-expert human panel without requiring a reference standard
- PMID: 41132782
- PMCID: PMC12543030
- DOI: 10.1117/1.JMI.12.5.055003
Statistical testing of agreement in overlap-based performance between an AI segmentation device and a multi-expert human panel without requiring a reference standard
Abstract
Purpose: Artificial intelligence (AI)-based medical imaging devices often include lesion or organ segmentation capabilities. Existing methods for segmentation performance evaluation compare AI results with an aggregated reference standard using accuracy metrics such as the Dice coefficient or Hausdorff distance. However, these approaches are limited by lacking a gold standard and challenges in defining meaningful success criteria. To address this, we developed a statistical method to assess agreement between an AI device and multiple human experts without requiring a reference standard.
Approach: We propose a paired-testing method to evaluate whether an AI device's segmentation performance significantly differs from that of multiple human experts. The method compares device-to-expert dissimilarity with expert-to-expert dissimilarity, avoiding the need for a reference standard. We validated the method through (1) statistical simulations where the Dice coefficient performance is either shared ("overlap agreeable") or not shared ("overlap disagreeable") between the device and experts; (2) image-based simulations using 2D contours with shared or nonshared transformation parameters (transformation agreeable or disagreeable). We also applied the method to compare an AI segmentation algorithm with four radiologists using data from the Lung Image Database Consortium.
Results: Statistical simulations show the method controls type I error ( ) for overlap-agreeable and type II error ( ) for overlap-disagreeable scenarios. Image-based simulations show acceptable performance with a mean type I error of 0.07 (SD 0.03) for transformation-agreeable and a mean type II error of 0.07 (SD 0.18) for transformation-disagreeable cases.
Conclusions: The paired-testing method offers a new tool for assessing the agreement between an AI segmentation device and multiple human expert panelists without requiring a reference standard.
Keywords: multi-expert human panel; paired testing; segmentation assessment.
Published by SPIE.
References
LinkOut - more resources
Full Text Sources
