Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study
- PMID: 39018102
- PMCID: PMC11292156
- DOI: 10.2196/55799
Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study
Abstract
Background: Large language models show promise for improving radiology workflows, but their performance on structured radiological tasks such as Reporting and Data Systems (RADS) categorization remains unexplored.
Objective: This study aims to evaluate 3 large language model chatbots-Claude-2, GPT-3.5, and GPT-4-on assigning RADS categories to radiology reports and assess the impact of different prompting strategies.
Methods: This cross-sectional study compared 3 chatbots using 30 radiology reports (10 per RADS criteria), using a 3-level prompting strategy: zero-shot, few-shot, and guideline PDF-informed prompts. The cases were grounded in Liver Imaging Reporting & Data System (LI-RADS) version 2018, Lung CT (computed tomography) Screening Reporting & Data System (Lung-RADS) version 2022, and Ovarian-Adnexal Reporting & Data System (O-RADS) magnetic resonance imaging, meticulously prepared by board-certified radiologists. Each report underwent 6 assessments. Two blinded reviewers assessed the chatbots' response at patient-level RADS categorization and overall ratings. The agreement across repetitions was assessed using Fleiss κ.
Results: Claude-2 achieved the highest accuracy in overall ratings with few-shot prompts and guideline PDFs (prompt-2), attaining 57% (17/30) average accuracy over 6 runs and 50% (15/30) accuracy with k-pass voting. Without prompt engineering, all chatbots performed poorly. The introduction of a structured exemplar prompt (prompt-1) increased the accuracy of overall ratings for all chatbots. Providing prompt-2 further improved Claude-2's performance, an enhancement not replicated by GPT-4. The interrun agreement was substantial for Claude-2 (k=0.66 for overall rating and k=0.69 for RADS categorization), fair for GPT-4 (k=0.39 for both), and fair for GPT-3.5 (k=0.21 for overall rating and k=0.39 for RADS categorization). All chatbots showed significantly higher accuracy with LI-RADS version 2018 than with Lung-RADS version 2022 and O-RADS (P<.05); with prompt-2, Claude-2 achieved the highest overall rating accuracy of 75% (45/60) in LI-RADS version 2018.
Conclusions: When equipped with structured prompts and guideline PDFs, Claude-2 demonstrated potential in assigning RADS categories to radiology cases according to established criteria such as LI-RADS version 2018. However, the current generation of chatbots lags in accurately categorizing cases based on more recent RADS criteria.
Keywords: ChatGPT; LI-RADS; Lung-RADS; O-RADS; Radiology Reporting and Data Systems; accuracy; categorization; chatbot; chatbots; large language model; recommendation; recommendations.
©Qingxia Wu, Qingxia Wu, Huali Li, Yan Wang, Yan Bai, Yaping Wu, Xuan Yu, Xiaodong Li, Pei Dong, Jon Xue, Dinggang Shen, Meiyun Wang. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 17.07.2024.
Conflict of interest statement
Conflicts of Interest: QW and PD are senior engineers of Beijing United Imaging Research Institute of Intelligent Imaging and United Imaging Intelligence (Beijing) Co, Ltd. JX and DS are senior specialists of Shanghai United Imaging Intelligence Co, Ltd. The companies have no role in designing and performing the surveillance and analyzing and interpreting the data. All other authors report no conflicts of interest relevant to this article.
Figures




Similar articles
-
Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5th edition.Diagn Interv Radiol. 2025 Mar 3;31(2):111-129. doi: 10.4274/dir.2024.242876. Epub 2024 Sep 9. Diagn Interv Radiol. 2025. PMID: 39248152 Free PMC article.
-
Using GPT-4 for LI-RADS feature extraction and categorization with multilingual free-text reports.Liver Int. 2024 Jul;44(7):1578-1587. doi: 10.1111/liv.15891. Epub 2024 Apr 23. Liver Int. 2024. PMID: 38651924
-
ChatGPT yields low accuracy in determining LI-RADS scores based on free-text and structured radiology reports in German language.Front Radiol. 2024 Jul 5;4:1390774. doi: 10.3389/fradi.2024.1390774. eCollection 2024. Front Radiol. 2024. PMID: 39036542 Free PMC article.
-
2017 Version of LI-RADS for CT and MR Imaging: An Update.Radiographics. 2017 Nov-Dec;37(7):1994-2017. doi: 10.1148/rg.2017170098. Radiographics. 2017. PMID: 29131761 Review.
-
Inter-reader reliability of CT Liver Imaging Reporting and Data System according to imaging analysis methodology: a systematic review and meta-analysis.Eur Radiol. 2021 Sep;31(9):6856-6867. doi: 10.1007/s00330-021-07815-y. Epub 2021 Mar 13. Eur Radiol. 2021. PMID: 33713172
Cited by
-
Evaluating the Accuracy of Privacy-Preserving Large Language Models in Calculating the Spinal Instability Neoplastic Score (SINS).Cancers (Basel). 2025 Jun 20;17(13):2073. doi: 10.3390/cancers17132073. Cancers (Basel). 2025. PMID: 40647373 Free PMC article.
-
Navigating Ovarian-Adnexal Reporting and Data System Magnetic Resonance Imaging (O-RADS MRI): A Review of Its Evolution, Current Advances, and Persistent Challenges in Ovarian Imaging.Cureus. 2025 Jun 25;17(6):e86717. doi: 10.7759/cureus.86717. eCollection 2025 Jun. Cureus. 2025. PMID: 40718198 Free PMC article. Review.
-
Evaluation of large language models in generating pulmonary nodule follow-up recommendations.Eur J Radiol Open. 2025 Apr 30;14:100655. doi: 10.1016/j.ejro.2025.100655. eCollection 2025 Jun. Eur J Radiol Open. 2025. PMID: 40391069 Free PMC article.
-
Comparative performance of large language models in structuring head CT radiology reports: multi-institutional validation study in Japan.Jpn J Radiol. 2025 Sep;43(9):1445-1455. doi: 10.1007/s11604-025-01799-1. Epub 2025 May 14. Jpn J Radiol. 2025. PMID: 40366571 Free PMC article.
References
-
- Rao A, Kim J, Kamineni M, Pang M, Lie W, Dreyer KJ, Succi MD. Evaluating GPT as an adjunct for radiologic decision making: GPT-4 versus GPT-3.5 in a breast imaging pilot. J Am Coll Radiol. 2023 Oct;20(10):990–997. doi: 10.1016/j.jacr.2023.05.003. https://europepmc.org/abstract/MED/37356806 S1546-1440(23)00394-0 - DOI - PMC - PubMed
LinkOut - more resources
Full Text Sources