Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2024 Aug 30;15(1):7525.
doi: 10.1038/s41467-024-51725-8.

Comparison of AI-integrated pathways with human-AI interaction in population mammographic screening for breast cancer

Collaborators, Affiliations
Comparative Study

Comparison of AI-integrated pathways with human-AI interaction in population mammographic screening for breast cancer

Helen M L Frazer et al. Nat Commun. .

Abstract

Artificial intelligence (AI) readers of mammograms compare favourably to individual radiologists in detecting breast cancer. However, AI readers cannot perform at the level of multi-reader systems used by screening programs in countries such as Australia, Sweden, and the UK. Therefore, implementation demands human-AI collaboration. Here, we use a large, high-quality retrospective mammography dataset from Victoria, Australia to conduct detailed simulations of five potential AI-integrated screening pathways, and examine human-AI interaction effects to explore automation bias. Operating an AI reader as a second reader or as a high confidence filter improves current screening outcomes by 1.9-2.5% in sensitivity and up to 0.6% in specificity, achieving 4.6-10.9% reduction in assessments and 48-80.7% reduction in human reads. Automation bias degrades performance in multi-reader settings but improves it for single-readers. This study provides insight into feasible approaches for AI-integrated screening pathways and prospective studies necessary prior to clinical adoption.

PubMed Disclaimer

Conflict of interest statement

P.B. is an employee of annalise.ai. C.W., Y.C., D.J.M., M.S.E., H.M.L.F. and G.C. are inventors on a patent, 'WO2024044815—Improved classification methods for machine learning', a model used in versions of the BRAIx AI reader. The remaining authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Screening episode flows for the current reader system and AI-integration scenarios.
A Standard of care scenario: Readers 1 and 2 see the same episode and opt to recall or not-recall, if they disagree Reader 3 arbitrates. B AI standalone scenario: all decisions are taken by the AI Reader without human intervention. C AI single-reader scenario: Reader 1 takes the final decision using AI Reader input. D AI reader-replacement: same as (A) but with AI Reader replacing Reader 2. E AI band-pass scenario: AI Reader screens out episodes before Readers 1 and 2. Episodes with high scores trigger the recall decision directly, and episodes with low scores trigger the no-recall decision directly. The other episodes continue to the usual reader system. F AI triage scenario: AI reader triages the episodes before Readers 1 and 2. Episodes with high scores continue to the usual system, and episodes with low scores go through the path with only 1 reader.
Fig. 2
Fig. 2. Performance of the AI reader on the retrospective cohort.
A The AI reader ROC curve compared with the weighted mean individual reader and reader consensus. The AI reader achieved an AUC of 0.932 (95% CI 0.923, 0.940, n = 149,105 screening episodes) above the weighted mean individual reader performance (95.6% specificity, 66.7% sensitivity) but below the reader consensus performance (96.1% specificity, 79.8% sensitivity; standard of care). The weighted mean individual reader (black circle; n = 125 readers) is the mean sensitivity and specificity of all the individual readers (grey circles) weighted by their respective total number of reads. B, C AI reader compared against 81 individual readers (min. 1000 reads). An optimal point from each AI reader ROC curve is shown for each comparison. We show separately human readers for which both sensitivity and specificity of the AI reader point was greater than or equal to the reader (B; 74 readers, 91.3% of readers; 253,328 reads, 88.3% of reads) and readers for which the AI reader is less than or equal to the human reader in either sensitivity or specificity (C; 7 readers, 8.6%; 33,525 reads, 11.7%). Source data are provided as a Source Data file.
Fig. 3
Fig. 3. Comparison of AI-integrated scenarios.
A Human reader consensus performance compared with AI standalone, AI reader-replacement, AI band-pass and AI triage on the retrospective cohort (n = 149,105 screening episodes) without interaction effects. Representative points are shown for AI standalone (96.0% specificity, 75.0% sensitivity), AI single reader (95.6% specificity, 67.3% sensitivity), AI reader-replacement (96.3% specificity, 82.3% sensitivity), AI band-pass (96.6%, 81.7%) and AI triage (95.7% specificity, 78.0% sensitivity). Other potential operating points are shown as a continuous line. Both AI reader-replacement and AI band-pass improved performance over the human reader consensus (96.1% specificity, 79.8% sensitivity). B AI-integrated scenarios when reader performance is varied with an interaction effect when the human reader disagrees with the AI reader. From 0% to 50% of discordant decisions are reversed when the AI reader was correct (triangle, positive effect), uniformly (circle, neutral effect) and incorrect (diamond, negative effect). For AI triage to match human reader consensus performance, a 15% positive interaction effect of the AI reader on human readers is required. Source data are provided as a Source Data file.
Fig. 4
Fig. 4. Screening episode exclusion criteria.
Flow diagram of study exclusion criteria for screening episodes from the standardised screening pathway at BreastScreen Victoria. Missing data could be clinical data without mammograms or mammograms without clinical data, clinical data could also be incomplete missing assessment, reader or screening records. Earlier screening attempt refers to a client returning for imaging as part of the same screening round, only the last attempt was used. Failed outcome determination and failed outcome reduction refer to being unable to confirm the final screening outcome for the episode. Missing reader records refer to missing reader data. Inconsistent recall status refers to conflicting data sources on whether episodes was recalled. Incomplete screening years refers to years in which we did not have the full year of data to sample from (2013–2015), these years were excluded from testing and development datasets as they are not representative.

References

    1. World Cancer Research Fund. Breast cancer. https://www.wcrf.org/dietandcancer/breast-cancer/ (2021).
    1. Australian Institute of Health and Welfare. BreastScreen Australia Monitoring Report 2022 (Australian Institute of Health and Welfare, 2022).
    1. Morrell, S., Taylor, R., Roder, D. & Dobson, A. Mammography screening and breast cancer mortality in australia: an aggregate cohort study. J. Med. Screen.19, 26–34 (2012).10.1258/jms.2012.011127 - DOI - PubMed
    1. Rodríguez-Ruiz, A. et al. Stand-alone artificial intelligence for breast cancer detection in mammography: comparison with 101 radiologistsdembrower. J. Natl. Cancer Inst.111, 916–922 (2019). 10.1093/jnci/djy222 - DOI - PMC - PubMed
    1. Rodríguez-Ruiz, A. et al. Detection of breast cancer with mammography: effect of an artificial intelligence support system. Radiology290, 305–314 (2019). 10.1148/radiol.2018181371 - DOI - PubMed

Publication types

LinkOut - more resources