Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study

Qingxia Wu^#¹, Qingxia Wu^#^{2

3}, Huali Li⁴, Yan Wang¹, Yan Bai¹, Yaping Wu¹, Xuan Yu¹, Xiaodong Li¹, Pei Dong^{2

3}, Jon Xue⁵, Dinggang Shen^{5

6}, Meiyun Wang^{1

7}

Affiliations

¹ Department of Medical Imaging, Henan Provincial People's Hospital & People's Hospital of Zhengzhou University, Zhengzhou, China.
² Research Intelligence Department, Beijing United Imaging Research Institute of Intelligent Imaging, Beijing, China.
³ Research and Collaboration, United Imaging Intelligence (Beijing) Co, Ltd, Beijing, China.
⁴ Department of Radiology, Luoyang Central Hospital, Luoyang, China.
⁵ Research and Collaboration, Shanghai United Imaging Intelligence Co, Ltd, Shanghai, China.
⁶ School of Biomedical Engineering, Shanghai Tech University, Shanghai, China.
⁷ Biomedical Research Institute, Henan Academy of Sciences, Zhengzhou, China.

^# Contributed equally.

PMID: 39018102
PMCID: PMC11292156
DOI: 10.2196/55799

Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study

Qingxia Wu et al. JMIR Med Inform. 2024.

. 2024 Jul 17:12:e55799.

doi: 10.2196/55799.

Authors

Qingxia Wu^#¹, Qingxia Wu^#^{2

3}, Huali Li⁴, Yan Wang¹, Yan Bai¹, Yaping Wu¹, Xuan Yu¹, Xiaodong Li¹, Pei Dong^{2

3}, Jon Xue⁵, Dinggang Shen^{5

6}, Meiyun Wang^{1

7}

Affiliations

¹ Department of Medical Imaging, Henan Provincial People's Hospital & People's Hospital of Zhengzhou University, Zhengzhou, China.
² Research Intelligence Department, Beijing United Imaging Research Institute of Intelligent Imaging, Beijing, China.
³ Research and Collaboration, United Imaging Intelligence (Beijing) Co, Ltd, Beijing, China.
⁴ Department of Radiology, Luoyang Central Hospital, Luoyang, China.
⁵ Research and Collaboration, Shanghai United Imaging Intelligence Co, Ltd, Shanghai, China.
⁶ School of Biomedical Engineering, Shanghai Tech University, Shanghai, China.
⁷ Biomedical Research Institute, Henan Academy of Sciences, Zhengzhou, China.

^# Contributed equally.

PMID: 39018102
PMCID: PMC11292156
DOI: 10.2196/55799

Abstract

Background: Large language models show promise for improving radiology workflows, but their performance on structured radiological tasks such as Reporting and Data Systems (RADS) categorization remains unexplored.

Objective: This study aims to evaluate 3 large language model chatbots-Claude-2, GPT-3.5, and GPT-4-on assigning RADS categories to radiology reports and assess the impact of different prompting strategies.

Methods: This cross-sectional study compared 3 chatbots using 30 radiology reports (10 per RADS criteria), using a 3-level prompting strategy: zero-shot, few-shot, and guideline PDF-informed prompts. The cases were grounded in Liver Imaging Reporting & Data System (LI-RADS) version 2018, Lung CT (computed tomography) Screening Reporting & Data System (Lung-RADS) version 2022, and Ovarian-Adnexal Reporting & Data System (O-RADS) magnetic resonance imaging, meticulously prepared by board-certified radiologists. Each report underwent 6 assessments. Two blinded reviewers assessed the chatbots' response at patient-level RADS categorization and overall ratings. The agreement across repetitions was assessed using Fleiss κ.

Results: Claude-2 achieved the highest accuracy in overall ratings with few-shot prompts and guideline PDFs (prompt-2), attaining 57% (17/30) average accuracy over 6 runs and 50% (15/30) accuracy with k-pass voting. Without prompt engineering, all chatbots performed poorly. The introduction of a structured exemplar prompt (prompt-1) increased the accuracy of overall ratings for all chatbots. Providing prompt-2 further improved Claude-2's performance, an enhancement not replicated by GPT-4. The interrun agreement was substantial for Claude-2 (k=0.66 for overall rating and k=0.69 for RADS categorization), fair for GPT-4 (k=0.39 for both), and fair for GPT-3.5 (k=0.21 for overall rating and k=0.39 for RADS categorization). All chatbots showed significantly higher accuracy with LI-RADS version 2018 than with Lung-RADS version 2022 and O-RADS (P<.05); with prompt-2, Claude-2 achieved the highest overall rating accuracy of 75% (45/60) in LI-RADS version 2018.

Conclusions: When equipped with structured prompts and guideline PDFs, Claude-2 demonstrated potential in assigning RADS categories to radiology cases according to established criteria such as LI-RADS version 2018. However, the current generation of chatbots lags in accurately categorizing cases based on more recent RADS criteria.

Keywords: ChatGPT; LI-RADS; Lung-RADS; O-RADS; Radiology Reporting and Data Systems; accuracy; categorization; chatbot; chatbots; large language model; recommendation; recommendations.

©Qingxia Wu, Qingxia Wu, Huali Li, Yan Wang, Yan Bai, Yaping Wu, Xuan Yu, Xiaodong Li, Pei Dong, Jon Xue, Dinggang Shen, Meiyun Wang. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 17.07.2024.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: QW and PD are senior engineers of Beijing United Imaging Research Institute of Intelligent Imaging and United Imaging Intelligence (Beijing) Co, Ltd. JX and DS are senior specialists of Shanghai United Imaging Intelligence Co, Ltd. The companies have no role in designing and performing the surveillance and analyzing and interpreting the data. All other authors report no conflicts of interest relevant to this article.

Figures

**Figure 1**
Flowchart of the study design. CT: computed tomography; LI-RADS: Liver Imaging Reporting & Data System; Lung-RADS: Lung CT Screening Reporting & Data System; MRI: magnetic resonance imaging; O-RADS: Ovarian-Adnexal Reporting & Data System; RADS: Reporting and Data Systems.

**Figure 2**
Bar graphs show the comparison of chatbot performance across 6 runs regarding (A) overall rating and (B) patient-level Reporting and Data Systems categorization.

**Figure 3**
The performance of chatbots and prompts within different Reporting and Data Systems criteria. (A) Overall rating and (B) patient-level RADS categorization. LI-RADS: Liver Imaging Reporting and Data System; Lung-RADS: Lung CT (computed tomography) Screening Reporting and Data System; O-RADS: Ovarian-Adnexal Reporting and Data System.

**Figure 4**
The number of error types for different chatbots. E1: Factual extraction error denotes the chatbots’ inability to paraphrase the radiological findings accurately, consequently misinterpreting the information. E2: Hallucination, encompassing the fabrication of nonexistent Reporting and Data Systems (RADS) categories (E2a) and RADS criteria (E2b). E3: Reasoning error, which includes the incapacity to logically interpret the imaging description (E3a) and the RADS category accurately (E3b). The subtype errors for reasoning imaging description include the inability to reason lesion signal (E3ai), lesion size (E3aii), and enhancement (E3aiii) accurately. E4: Explanatory error, encompassing inaccurate elucidation of RADS category meaning (E4a) and erroneous explanation of the recommended management and follow-up corresponding to the RADS category (E4b).

See this image and copyright information in PMC

References

1. Li R, Kumar A, Chen JH. How chatbots and large language model artificial intelligence systems will reshape modern medicine: fountain of creativity or Pandora's box? JAMA Intern Med. 2023 Jun 01;183(6):596–597. doi: 10.1001/jamainternmed.2023.1835.2804310 - DOI - PMC - PubMed
1. Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a radiology board-style examination: insights into current strengths and limitations. Radiology. 2023 Jun;307(5):e230582. doi: 10.1148/radiol.230582. - DOI - PubMed
1. Rao A, Kim J, Kamineni M, Pang M, Lie W, Dreyer KJ, Succi MD. Evaluating GPT as an adjunct for radiologic decision making: GPT-4 versus GPT-3.5 in a breast imaging pilot. J Am Coll Radiol. 2023 Oct;20(10):990–997. doi: 10.1016/j.jacr.2023.05.003. https://europepmc.org/abstract/MED/37356806 S1546-1440(23)00394-0 - DOI - PMC - PubMed
1. Ueda D, Mitsuyama Y, Takita H, Horiuchi D, Walston SL, Tatekawa H, Miki Y. ChatGPT's diagnostic performance from patient history and imaging findings on the diagnosis please quizzes. Radiology. 2023 Jul;308(1):e231040. doi: 10.1148/radiol.231040. - DOI - PubMed
1. Kottlors J, Bratke G, Rauen P, Kabbasch C, Persigehl T, Schlamann M, Lennartz S. Feasibility of differential diagnosis based on imaging patterns using a large language model. Radiology. 2023 Jul;308(1):e231167. doi: 10.1148/radiol.231167. - DOI - PubMed

LinkOut - more resources

Full Text Sources
- JMIR Publications
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study

Affiliations

Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources