Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jul 17:12:e55799.
doi: 10.2196/55799.

Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study

Affiliations

Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study

Qingxia Wu et al. JMIR Med Inform. .

Abstract

Background: Large language models show promise for improving radiology workflows, but their performance on structured radiological tasks such as Reporting and Data Systems (RADS) categorization remains unexplored.

Objective: This study aims to evaluate 3 large language model chatbots-Claude-2, GPT-3.5, and GPT-4-on assigning RADS categories to radiology reports and assess the impact of different prompting strategies.

Methods: This cross-sectional study compared 3 chatbots using 30 radiology reports (10 per RADS criteria), using a 3-level prompting strategy: zero-shot, few-shot, and guideline PDF-informed prompts. The cases were grounded in Liver Imaging Reporting & Data System (LI-RADS) version 2018, Lung CT (computed tomography) Screening Reporting & Data System (Lung-RADS) version 2022, and Ovarian-Adnexal Reporting & Data System (O-RADS) magnetic resonance imaging, meticulously prepared by board-certified radiologists. Each report underwent 6 assessments. Two blinded reviewers assessed the chatbots' response at patient-level RADS categorization and overall ratings. The agreement across repetitions was assessed using Fleiss κ.

Results: Claude-2 achieved the highest accuracy in overall ratings with few-shot prompts and guideline PDFs (prompt-2), attaining 57% (17/30) average accuracy over 6 runs and 50% (15/30) accuracy with k-pass voting. Without prompt engineering, all chatbots performed poorly. The introduction of a structured exemplar prompt (prompt-1) increased the accuracy of overall ratings for all chatbots. Providing prompt-2 further improved Claude-2's performance, an enhancement not replicated by GPT-4. The interrun agreement was substantial for Claude-2 (k=0.66 for overall rating and k=0.69 for RADS categorization), fair for GPT-4 (k=0.39 for both), and fair for GPT-3.5 (k=0.21 for overall rating and k=0.39 for RADS categorization). All chatbots showed significantly higher accuracy with LI-RADS version 2018 than with Lung-RADS version 2022 and O-RADS (P<.05); with prompt-2, Claude-2 achieved the highest overall rating accuracy of 75% (45/60) in LI-RADS version 2018.

Conclusions: When equipped with structured prompts and guideline PDFs, Claude-2 demonstrated potential in assigning RADS categories to radiology cases according to established criteria such as LI-RADS version 2018. However, the current generation of chatbots lags in accurately categorizing cases based on more recent RADS criteria.

Keywords: ChatGPT; LI-RADS; Lung-RADS; O-RADS; Radiology Reporting and Data Systems; accuracy; categorization; chatbot; chatbots; large language model; recommendation; recommendations.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: QW and PD are senior engineers of Beijing United Imaging Research Institute of Intelligent Imaging and United Imaging Intelligence (Beijing) Co, Ltd. JX and DS are senior specialists of Shanghai United Imaging Intelligence Co, Ltd. The companies have no role in designing and performing the surveillance and analyzing and interpreting the data. All other authors report no conflicts of interest relevant to this article.

Figures

Figure 1
Figure 1
Flowchart of the study design. CT: computed tomography; LI-RADS: Liver Imaging Reporting & Data System; Lung-RADS: Lung CT Screening Reporting & Data System; MRI: magnetic resonance imaging; O-RADS: Ovarian-Adnexal Reporting & Data System; RADS: Reporting and Data Systems.
Figure 2
Figure 2
Bar graphs show the comparison of chatbot performance across 6 runs regarding (A) overall rating and (B) patient-level Reporting and Data Systems categorization.
Figure 3
Figure 3
The performance of chatbots and prompts within different Reporting and Data Systems criteria. (A) Overall rating and (B) patient-level RADS categorization. LI-RADS: Liver Imaging Reporting and Data System; Lung-RADS: Lung CT (computed tomography) Screening Reporting and Data System; O-RADS: Ovarian-Adnexal Reporting and Data System.
Figure 4
Figure 4
The number of error types for different chatbots. E1: Factual extraction error denotes the chatbots’ inability to paraphrase the radiological findings accurately, consequently misinterpreting the information. E2: Hallucination, encompassing the fabrication of nonexistent Reporting and Data Systems (RADS) categories (E2a) and RADS criteria (E2b). E3: Reasoning error, which includes the incapacity to logically interpret the imaging description (E3a) and the RADS category accurately (E3b). The subtype errors for reasoning imaging description include the inability to reason lesion signal (E3ai), lesion size (E3aii), and enhancement (E3aiii) accurately. E4: Explanatory error, encompassing inaccurate elucidation of RADS category meaning (E4a) and erroneous explanation of the recommended management and follow-up corresponding to the RADS category (E4b).

Similar articles

Cited by

References

    1. Li R, Kumar A, Chen JH. How chatbots and large language model artificial intelligence systems will reshape modern medicine: fountain of creativity or Pandora's box? JAMA Intern Med. 2023 Jun 01;183(6):596–597. doi: 10.1001/jamainternmed.2023.1835.2804310 - DOI - PubMed
    1. Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a radiology board-style examination: insights into current strengths and limitations. Radiology. 2023 Jun;307(5):e230582. doi: 10.1148/radiol.230582. - DOI - PubMed
    1. Rao A, Kim J, Kamineni M, Pang M, Lie W, Dreyer KJ, Succi MD. Evaluating GPT as an adjunct for radiologic decision making: GPT-4 versus GPT-3.5 in a breast imaging pilot. J Am Coll Radiol. 2023 Oct;20(10):990–997. doi: 10.1016/j.jacr.2023.05.003. https://europepmc.org/abstract/MED/37356806 S1546-1440(23)00394-0 - DOI - PMC - PubMed
    1. Ueda D, Mitsuyama Y, Takita H, Horiuchi D, Walston SL, Tatekawa H, Miki Y. ChatGPT's diagnostic performance from patient history and imaging findings on the diagnosis please quizzes. Radiology. 2023 Jul;308(1):e231040. doi: 10.1148/radiol.231040. - DOI - PubMed
    1. Kottlors J, Bratke G, Rauen P, Kabbasch C, Persigehl T, Schlamann M, Lennartz S. Feasibility of differential diagnosis based on imaging patterns using a large language model. Radiology. 2023 Jul;308(1):e231167. doi: 10.1148/radiol.231167. - DOI - PubMed

LinkOut - more resources