Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2026 Feb;39(2):100944.
doi: 10.1016/j.modpat.2025.100944. Epub 2026 Jan 5.

Agreement Across 10 Artificial Intelligence Models in Assessing Human Epidermal Growth Factor Receptor 2 (HER2) Expression in Breast Cancer Whole-Slide Images

Affiliations
Free article

Agreement Across 10 Artificial Intelligence Models in Assessing Human Epidermal Growth Factor Receptor 2 (HER2) Expression in Breast Cancer Whole-Slide Images

Brittany McKelvey et al. Mod Pathol. 2026 Feb.
Free article

Abstract

Historically, eligibility for receiving human epidermal growth factor receptor 2 (HER2)-targeted therapies was limited to HER2-positive tumors (immunohistochemistry 3+ or in situ hybridization amplified), but recent advances in antibody-drug conjugates have expanded these criteria to include HER2-low and HER2-ultralow expression. This evolving therapeutic landscape underscores the need for precise and reproducible HER2 assessment. Digital and computational pathology tools may help address these needs, but their measurement variability must be evaluated to inform research and clinical use. We evaluated HER2 scoring variability across 10 independently developed computational pathology artificial intelligence models applied to 1124 whole-slide images from 733 patients with breast cancer. Analyses included American Society of Clinical Oncology-College of American Pathologists categorical scores (0, 1+, 2+, and 3+), H-scores, tumor cell staining percentages, and counts of total and stained invasive carcinoma cells. Agreement among models and 3 pathologists was assessed using pairwise overall percent agreement (OPA), Cohen kappa, and hierarchical clustering. Median model pairwise OPA for categorical HER2 scores was 65.1% (kappa, 0.51). Agreement was highest for HER2 3+ vs not 3+ (OPA, 97.3%; kappa, 0.86) and lowest for HER2-low cases, reflecting existing measurement challenges. For HER2 0 (negative) vs not 0 (positive) scoring, the average negative agreement was 65.3%, compared with the average positive agreement of 91.3%, suggesting more agreement in non-HER2 0 scores. H-score and cell count analyses indicated that scoring differences were more related to staining interpretation than tumor cell detection. Pathologists showed numerically higher concordance than models, but interobserver variability persisted. In exploratory analyses, sample type, staining artifacts, and heterogeneous HER2 expression appeared to be associated with discrepancies. Artificial intelligence-based HER2 scoring demonstrated high agreement in identifying HER2 3+ cases. Variability was most pronounced in borderline HER2 categories, particularly in HER2 low, underscoring the need for continued tool refinement for handling low-intensity staining. Standardized training data sets, validation frameworks, and regulatory alignment are important to improve reproducibility. Developing reference standards and benchmarking data sets is critical to evaluate performance, support regulatory decision-making, and ensure real-world applicability.

Keywords: HER2 scoring; artificial intelligence; breast cancer; computational pathology; whole-slide imaging.

PubMed Disclaimer

MeSH terms

Substances

LinkOut - more resources