Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Dec;17(12):1653-1662.
doi: 10.1016/j.jacr.2020.05.015. Epub 2020 Jun 24.

Multi-Institutional Assessment and Crowdsourcing Evaluation of Deep Learning for Automated Classification of Breast Density

Affiliations

Multi-Institutional Assessment and Crowdsourcing Evaluation of Deep Learning for Automated Classification of Breast Density

Ken Chang et al. J Am Coll Radiol. 2020 Dec.

Abstract

Objective: We developed deep learning algorithms to automatically assess BI-RADS breast density.

Methods: Using a large multi-institution patient cohort of 108,230 digital screening mammograms from the Digital Mammographic Imaging Screening Trial, we investigated the effect of data, model, and training parameters on overall model performance and provided crowdsourcing evaluation from the attendees of the ACR 2019 Annual Meeting.

Results: Our best-performing algorithm achieved good agreement with radiologists who were qualified interpreters of mammograms, with a four-class κ of 0.667. When training was performed with randomly sampled images from the data set versus sampling equal number of images from each density category, the model predictions were biased away from the low-prevalence categories such as extremely dense breasts. The net result was an increase in sensitivity and a decrease in specificity for predicting dense breasts for equal class compared with random sampling. We also found that the performance of the model degrades when we evaluate on digital mammography data formats that differ from the one that we trained on, emphasizing the importance of multi-institutional training sets. Lastly, we showed that crowdsourced annotations, including those from attendees who routinely read mammograms, had higher agreement with our algorithm than with the original interpreting radiologists.

Conclusion: We demonstrated the possible parameters that can influence the performance of the model and how crowdsourcing can be used for evaluation. This study was performed in tandem with the development of the ACR AI-LAB, a platform for democratizing artificial intelligence.

Keywords: ACR AI-LAB; BI-RADS; DMIST; artificial intelligence; breast density; deep learning; generalizability; mammogram; neural networks.

PubMed Disclaimer

Conflict of interest statement

Conflict of Interest

J.K. is a consultant/advisory board member for Infotech, Soft. S.A. is an employee of Bayer HealthCare. The other authors declare no competing interests.

Figures

Figure 1.
Figure 1.
(A) A summary of all the data, model, and training parameter experiments performed. (B) Performance on the testing set (measured by 4-class κ agreement with radiologist interpretation) increased as the percentage of training set used. The 95% confidence interval is plotted in light green. (C) Effect of model and training parameters on testing set 4-class κ agreement with radiologist interpretation. Black lines denote 95% confidence interval. P-values are denoted by *p < .05, **p < .01, ****p < .001
Figure 2.
Figure 2.
(A) A visual display of the range of classifications for models trained with different model and training parameters for 50 patients in the testing set. The radiologist interpretation is displayed in the first row. The average breast density rating across all models and radiologist interpretation is displayed in the last row and was used to order the patients from least dense (left) to most dense (right). (B) The distribution of predicted breast density labels in the testing set differed for experiments with random class sampling (left) compared to equal class sampling (right) at each mini-batch. P-values are denoted by ****p < .001
Figure 3.
Figure 3.
(A) Intensity distribution histogram (Frequency vs Intensity Value) of 100 randomly selected images of each pixel format. (B) Visualization of the histogram of intensities of 3000 preprocessed images from the testing set demonstrating clustering of images by image format. (C) Performance of models trained on specific image formats as well as all images, showing that for image format specific models, testing set performance was decreased for other image formats compared to the image format the model was trained on. (D-E) Visualization of an intermediate layer of the trained neural network for 3000 images in the testing set, color-coded by image format and radiologist interpretation of breast density.
Figure 4.
Figure 4.
Confusion matrices showing the agreement between original interpreting radiologist, algorithm, and crowd. The agreement between the algorithm and crowd (B) was greater than the agreement between crowd and original interpreting radiologist (A). The agreement between algorithm and original interpreting radiologist for the same patient studies (C) shown for reference. (D) There was higher agreement, in terms of 4-class κ, with the algorithm than with the original interpreting radiologist from the DMIST trial for both crowdsourcing participants who read mammograms and those who do not. P-values are denoted by *p < .001

References

    1. Siegel RL, Miller KD, Jemal A . Cancer statistics, 2019. CA Cancer J Clin 2019;69:7–34. doi:10.3322/caac.21551. - DOI - PubMed
    1. Duffy SW, Tabár L, Chen H-H, Holmqvist M, Yen M-F, Abdsalah S, et al. The impact of organized mammography service screening on breast carcinoma mortality in seven Swedish counties. Cancer 2002;95:458–69. doi:10.1002/cncr.10765. - DOI - PubMed
    1. Tabár L, Vitak B, Chen HH, Yen MF, Duffy SW, Smith RA. Beyond randomized controlled trials: organized mammographic screening substantially reduces breast carcinoma mortality. Cancer 2001;91:1724–31. - PubMed
    1. Boyd NF, Byng JW, Jong RA, Fishell EK, Little LE, Miller AB, et al. Quantitative Classification of Mammographic Densities and Breast Cancer Risk: Results From the Canadian National Breast Screening Study. JNCI J Natl Cancer Inst 1995;87:670–5. doi:10.1093/jnci/87.9.670. - DOI - PubMed
    1. Razzaghi H, Troester MA, Gierach GL, Olshan AF, Yankaskas BC, Millikan RC. Mammographic density and breast cancer risk in White and African American Women. Breast Cancer Res Treat 2012. doi:10.1007/s10549-012-2185-3. - DOI - PMC - PubMed