Multi-Institutional Assessment and Crowdsourcing Evaluation of Deep Learning for Automated Classification of Breast Density

Affiliations

¹ Athinoula A. Martinos Center for Biomedical Imaging, Department of Radiology, Massachusetts General Hospital, Boston, Massachusetts.
² American College of Radiology, Reston, Virginia.
³ Chief Research Officer (ACR), Reston, Virginia; Professor in Residence, Beth Israel Lahey/Harvard Medical School, Boston, Massachusetts.
⁴ Chief Information Officer and EVP for Technology (ACR), Reston, Virginia.
⁵ Vice President (ACR), Reston, Virginia.
⁶ Chief Data Science Officer, Chief Imaging Information Officer, Massachussetts General Hospital and Brigham Women's Hospital (MGH & BWH), Chief Executive, MGH & BWH Center for Clinical Data Science; Vice Chairman of Radiology - Informatics, MGH & BWH, Boston, Massachusetts; Associate Professor of Radiology,Harvard Medical School, Boston, Massachusetts; Chief Science Officer, ACR Data Science Institute, Reston, Virginia.
⁷ Chief Medical Office, ACR Data Science Institute, Reston, Virginia; Secretary General, International Society of Radiology, Reston, Virginia; Partner, Grandview Medical Center, Birmingham, Alabama.
⁸ Lennox Hill Radiology, New York, New York.
⁹ Athinoula A. Martinos Center for Biomedical Imaging, Department of Radiology, Massachusetts General Hospital, Boston, Massachusetts; Scientific Director (CCDS), Director (QTIM lab and the Center for Machine Learning), Associate Professor of Radiology, MGH/Harvard Medical School, Boston, Massachusetts. Electronic address: kalpathy@nmr.mgh.harvard.edu.

PMID: 32592660
PMCID: PMC10757768
DOI: 10.1016/j.jacr.2020.05.015

Multi-Institutional Assessment and Crowdsourcing Evaluation of Deep Learning for Automated Classification of Breast Density

Ken Chang et al. J Am Coll Radiol. 2020 Dec.

. 2020 Dec;17(12):1653-1662.

doi: 10.1016/j.jacr.2020.05.015. Epub 2020 Jun 24.

Authors

Affiliations

¹ Athinoula A. Martinos Center for Biomedical Imaging, Department of Radiology, Massachusetts General Hospital, Boston, Massachusetts.
² American College of Radiology, Reston, Virginia.
³ Chief Research Officer (ACR), Reston, Virginia; Professor in Residence, Beth Israel Lahey/Harvard Medical School, Boston, Massachusetts.
⁴ Chief Information Officer and EVP for Technology (ACR), Reston, Virginia.
⁵ Vice President (ACR), Reston, Virginia.
⁶ Chief Data Science Officer, Chief Imaging Information Officer, Massachussetts General Hospital and Brigham Women's Hospital (MGH & BWH), Chief Executive, MGH & BWH Center for Clinical Data Science; Vice Chairman of Radiology - Informatics, MGH & BWH, Boston, Massachusetts; Associate Professor of Radiology,Harvard Medical School, Boston, Massachusetts; Chief Science Officer, ACR Data Science Institute, Reston, Virginia.
⁷ Chief Medical Office, ACR Data Science Institute, Reston, Virginia; Secretary General, International Society of Radiology, Reston, Virginia; Partner, Grandview Medical Center, Birmingham, Alabama.
⁸ Lennox Hill Radiology, New York, New York.
⁹ Athinoula A. Martinos Center for Biomedical Imaging, Department of Radiology, Massachusetts General Hospital, Boston, Massachusetts; Scientific Director (CCDS), Director (QTIM lab and the Center for Machine Learning), Associate Professor of Radiology, MGH/Harvard Medical School, Boston, Massachusetts. Electronic address: kalpathy@nmr.mgh.harvard.edu.

PMID: 32592660
PMCID: PMC10757768
DOI: 10.1016/j.jacr.2020.05.015

Abstract

Objective: We developed deep learning algorithms to automatically assess BI-RADS breast density.

Methods: Using a large multi-institution patient cohort of 108,230 digital screening mammograms from the Digital Mammographic Imaging Screening Trial, we investigated the effect of data, model, and training parameters on overall model performance and provided crowdsourcing evaluation from the attendees of the ACR 2019 Annual Meeting.

Results: Our best-performing algorithm achieved good agreement with radiologists who were qualified interpreters of mammograms, with a four-class κ of 0.667. When training was performed with randomly sampled images from the data set versus sampling equal number of images from each density category, the model predictions were biased away from the low-prevalence categories such as extremely dense breasts. The net result was an increase in sensitivity and a decrease in specificity for predicting dense breasts for equal class compared with random sampling. We also found that the performance of the model degrades when we evaluate on digital mammography data formats that differ from the one that we trained on, emphasizing the importance of multi-institutional training sets. Lastly, we showed that crowdsourced annotations, including those from attendees who routinely read mammograms, had higher agreement with our algorithm than with the original interpreting radiologists.

Conclusion: We demonstrated the possible parameters that can influence the performance of the model and how crowdsourcing can be used for evaluation. This study was performed in tandem with the development of the ACR AI-LAB, a platform for democratizing artificial intelligence.

Keywords: ACR AI-LAB; BI-RADS; DMIST; artificial intelligence; breast density; deep learning; generalizability; mammogram; neural networks.

PubMed Disclaimer

Conflict of interest statement

Conflict of Interest

J.K. is a consultant/advisory board member for Infotech, Soft. S.A. is an employee of Bayer HealthCare. The other authors declare no competing interests.

Figures

**Figure 1.**
(A) A summary of all the data, model, and training parameter experiments performed. (B) Performance on the testing set (measured by 4-class κ agreement with radiologist interpretation) increased as the percentage of training set used. The 95% confidence interval is plotted in light green. (C) Effect of model and training parameters on testing set 4-class κ agreement with radiologist interpretation. Black lines denote 95% confidence interval. P-values are denoted by *p < .05, **p < .01, ****p < .001

**Figure 2.**
(A) A visual display of the range of classifications for models trained with different model and training parameters for 50 patients in the testing set. The radiologist interpretation is displayed in the first row. The average breast density rating across all models and radiologist interpretation is displayed in the last row and was used to order the patients from least dense (left) to most dense (right). (B) The distribution of predicted breast density labels in the testing set differed for experiments with random class sampling (left) compared to equal class sampling (right) at each mini-batch. P-values are denoted by ****p < .001

**Figure 3.**
(A) Intensity distribution histogram (Frequency vs Intensity Value) of 100 randomly selected images of each pixel format. (B) Visualization of the histogram of intensities of 3000 preprocessed images from the testing set demonstrating clustering of images by image format. (C) Performance of models trained on specific image formats as well as all images, showing that for image format specific models, testing set performance was decreased for other image formats compared to the image format the model was trained on. (D-E) Visualization of an intermediate layer of the trained neural network for 3000 images in the testing set, color-coded by image format and radiologist interpretation of breast density.

**Figure 4.**
Confusion matrices showing the agreement between original interpreting radiologist, algorithm, and crowd. The agreement between the algorithm and crowd (B) was greater than the agreement between crowd and original interpreting radiologist (A). The agreement between algorithm and original interpreting radiologist for the same patient studies (C) shown for reference. (D) There was higher agreement, in terms of 4-class κ, with the algorithm than with the original interpreting radiologist from the DMIST trial for both crowdsourcing participants who read mammograms and those who do not. P-values are denoted by *p < .001

See this image and copyright information in PMC

References

1. Siegel RL, Miller KD, Jemal A . Cancer statistics, 2019. CA Cancer J Clin 2019;69:7–34. doi:10.3322/caac.21551. - DOI - PubMed
1. Duffy SW, Tabár L, Chen H-H, Holmqvist M, Yen M-F, Abdsalah S, et al. The impact of organized mammography service screening on breast carcinoma mortality in seven Swedish counties. Cancer 2002;95:458–69. doi:10.1002/cncr.10765. - DOI - PubMed
1. Tabár L, Vitak B, Chen HH, Yen MF, Duffy SW, Smith RA. Beyond randomized controlled trials: organized mammographic screening substantially reduces breast carcinoma mortality. Cancer 2001;91:1724–31. - PubMed
1. Boyd NF, Byng JW, Jong RA, Fishell EK, Little LE, Miller AB, et al. Quantitative Classification of Mammographic Densities and Breast Cancer Risk: Results From the Canadian National Breast Screening Study. JNCI J Natl Cancer Inst 1995;87:670–5. doi:10.1093/jnci/87.9.670. - DOI - PubMed
1. Razzaghi H, Troester MA, Gierach GL, Olshan AF, Yankaskas BC, Millikan RC. Mammographic density and breast cancer risk in White and African American Women. Breast Cancer Res Treat 2012. doi:10.1007/s10549-012-2185-3. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Multi-Institutional Assessment and Crowdsourcing Evaluation of Deep Learning for Automated Classification of Breast Density

Affiliations

Multi-Institutional Assessment and Crowdsourcing Evaluation of Deep Learning for Automated Classification of Breast Density

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical

Miscellaneous