Evaluating progress in automatic chest X-ray radiology report generation

Affiliations

¹ Department of Computer Science, Stanford University, Stanford, CA 94305, USA.
² Department of Radiology, Brigham and Women's Hospital, Boston, MA 02115, USA.
³ Department of Radiology, Boston Children's Hospital, Harvard Medical School, Boston, MA 02115, USA.
⁴ Cardiothoracic Radiology Group, Hospital Israelita Albert Einstein, São Paulo, São Paulo 05652, Brazil.
⁵ Dalla Lana School of Public Health, University of Toronto, Toronto, ON M5T 3M7, Canada.
⁶ AIMI Center, Stanford University, Stanford, CA 94304, USA.
⁷ CARPL.ai, New Delhi, Delhi 110016, India.
⁸ Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA.

PMID: 37720336
PMCID: PMC10499844
DOI: 10.1016/j.patter.2023.100802

Evaluating progress in automatic chest X-ray radiology report generation

Feiyang Yu et al. Patterns (N Y). 2023.

. 2023 Aug 3;4(9):100802.

doi: 10.1016/j.patter.2023.100802. eCollection 2023 Sep 8.

Affiliations

¹ Department of Computer Science, Stanford University, Stanford, CA 94305, USA.
² Department of Radiology, Brigham and Women's Hospital, Boston, MA 02115, USA.
³ Department of Radiology, Boston Children's Hospital, Harvard Medical School, Boston, MA 02115, USA.
⁴ Cardiothoracic Radiology Group, Hospital Israelita Albert Einstein, São Paulo, São Paulo 05652, Brazil.
⁵ Dalla Lana School of Public Health, University of Toronto, Toronto, ON M5T 3M7, Canada.
⁶ AIMI Center, Stanford University, Stanford, CA 94304, USA.
⁷ CARPL.ai, New Delhi, Delhi 110016, India.
⁸ Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA.

PMID: 37720336
PMCID: PMC10499844
DOI: 10.1016/j.patter.2023.100802

Abstract

Artificial intelligence (AI) models for automatic generation of narrative radiology reports from images have the potential to enhance efficiency and reduce the workload of radiologists. However, evaluating the correctness of these reports requires metrics that can capture clinically pertinent differences. In this study, we investigate the alignment between automated metrics and radiologists' scoring of errors in report generation. We address the limitations of existing metrics by proposing new metrics, RadGraph F1 and RadCliQ, which demonstrate stronger correlation with radiologists' evaluations. In addition, we analyze the failure modes of the metrics to understand their limitations and provide guidance for metric selection and interpretation. This study establishes RadGraph F1 and RadCliQ as meaningful metrics for guiding future research in radiology report generation.

Keywords: alignment with radiologists; automatic metrics; chest X-ray radiology report generation.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing non-financial interests but the following competing financial interests: I.P. is a consultant for MD.ai and Diagnosticos da America (Dasa). C.P.L. serves on the board of directors and is a shareholder of Bunkerhill Health. He is an advisor and option holder for GalileoCDS, Sirona Medical, Adra, and Kheiron. He is an advisor to Sixth Street and an option holder in whiterabbit.ai. His research program has received grant or gift support from Carestream, Clairity, GE Healthcare, Google Cloud, IBM, IDEXX, Hospital Israelita Albert Einstein, Kheiron, Lambda, Lunit, Microsoft, Nightingale Open Science, Nines, Philips, Subtle Medical, VinBrain, Whiterabbit.ai, the Paustenbach Fund, the Lowenstein Foundation, and the Gordon and Betty Moore Foundation.

Figures

**Figure 1**
Method overview (A) Experimental design for selecting radiology reports and comparing metrics and radiologists in evaluating reports. (B) Given a test report, selecting the report with the highest metric score from the training report corpus with respect to the test report and a particular metric. (C) Conducting radiologist evaluation on the high metric score report relative to the test report, where radiologists identify the number of clinically significant and insignificant errors in the high metric score report across six error categories. (D) Determining the alignment between metric scores and radiologist scores assigned to the same reports using the Kendall rank correlation coefficient.

**Figure 2**
Example study of reports, and error types and categories (A) Example study of a test report and four metric-oracle reports corresponding to BLEU, BERTScore, CheXbert vector similarity, and RadGraph F1 that radiologists evaluate to identify errors. (B) Two error types and six error categories that radiologists identify for each pair of test report and metric-oracle report.

**Figure 3**
Correlations between metric scores and radiologist scores Scatterplots and correlations between metric scores and radiologist scores of four metric-oracle generations from 50 studies, where radiologist scores are represented by the total number of clinically significant and insignificant errors (top row) and number of clinically significant errors (bottom row) identified by the radiologists. The translucent bands around the regression line represent 95% confidence intervals.

**Figure 4**
Distribution of errors across error categories for metric-oracle reports Distribution of errors across six error categories for metric-oracle reports corresponding to BERTScore, BLEU, CheXbert vector similarity, and RadGraph F1, in terms of the number of clinically significant errors (left) and the total number of clinically significant and insignificant errors (right). Statistical significance is determined using the Benjamini-Hochberg procedure with a false discovery rate (FDR) of 1% to correct for multiple-hypothesis testing.

See this image and copyright information in PMC

References

1. Rajpurkar P., Chen E., Banerjee O., Topol E.J. AI in health and medicine. Nat. Med. 2022;28:31–38. - PubMed
1. Jumper J., Evans R., Pritzel A., Green T., Figurnov M., Ronneberger O., Tunyasuvunakool K., Bates R., Žídek A., Potapenko A., et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–589. - PMC - PubMed
1. Rajpurkar P., Joshi A., Pareek A., Ng A.Y., Lungren M.P. Proceedings of the Conference on Health, Inference, and Learning. Association for Computing Machinery; 2021. CheXternal: Generalization of Deep Learning Models for Chest X-ray Interpretation to Photos of Chest X-rays and External Clinical Settings; pp. 125–132.
1. Jin B.T., Palleti R., Shi S., Ng A.Y., Quinn J.V., Rajpurkar P., Kim D. Transfer learning enables prediction of myocardial injury from continuous single-lead electrocardiography. J. Am. Med. Inf. Assoc. 2022;29:1908–1918. - PMC - PubMed
1. Rajpurkar P., Lungren M.P. The Current and Future State of AI Interpretation of Medical Images. N. Engl. J. Med. Overseas. Ed. 2023;388:1981–1990. - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Evaluating progress in automatic chest X-ray radiology report generation

Affiliations

Evaluating progress in automatic chest X-ray radiology report generation

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources