Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jul 5:4:1390774.
doi: 10.3389/fradi.2024.1390774. eCollection 2024.

ChatGPT yields low accuracy in determining LI-RADS scores based on free-text and structured radiology reports in German language

Affiliations

ChatGPT yields low accuracy in determining LI-RADS scores based on free-text and structured radiology reports in German language

Philipp Fervers et al. Front Radiol. .

Abstract

Background: To investigate the feasibility of the large language model (LLM) ChatGPT for classifying liver lesions according to the Liver Imaging Reporting and Data System (LI-RADS) based on MRI reports, and to compare classification performance on structured vs. unstructured reports.

Methods: LI-RADS classifiable liver lesions were included from German written structured and unstructured MRI reports with report of size, location, and arterial phase contrast enhancement as minimum inclusion requirements. The findings sections of the reports were propagated to ChatGPT (GPT-3.5), which was instructed to determine LI-RADS scores for each classifiable liver lesion. Ground truth was established by two radiologists in consensus. Agreement between ground truth and ChatGPT was assessed with Cohen's kappa. Test-retest reliability was assessed by passing a subset of n = 50 lesions five times to ChatGPT, using the intraclass correlation coefficient (ICC).

Results: 205 MRIs from 150 patients were included. The accuracy of ChatGPT at determining LI-RADS categories was poor (53% and 44% on unstructured and structured reports). The agreement to the ground truth was higher (k = 0.51 and k = 0.44), the mean absolute error in LI-RADS scores was lower (0.5 ± 0.5 vs. 0.6 ± 0.7, p < 0.05), and the test-retest reliability was higher (ICC = 0.81 vs. 0.50), in free-text compared to structured reports, respectively, although structured reports comprised the minimum required imaging features significantly more frequently (Chi-square test, p < 0.05).

Conclusions: ChatGPT attained only low accuracy when asked to determine LI-RADS scores from liver imaging reports. The superior accuracy and consistency throughout free-text reports might relate to ChatGPT's training process.

Clinical relevance statement: Our study indicates both the necessity of optimization of LLMs for structured clinical data input and the potential of LLMs for creating machine-readable labels based on large free-text radiological databases.

Keywords: LI-RADS (liver imaging reporting and data system); MRI; diagnosis; diagnostic imaging; liver; neoplasms.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

Figure 1
Figure 1
Enrollment of liver lesions. Along with the query to create a structured LI-RADS imaging report, the findings section of the MRI report was copied to the ChatGPT prompt without specific user interaction (https://chat.openai.com/chat). Besides description of liver lesions, possible incidental findings and non-liver pathologies were included. In this exemplary case, both liver lesions were correctly classified by ChatGPT according to the ground truth of two experienced radiologists. To preclude interference with ChatGPT's context sensitivity, the prompt was restarted after each query. Please note that for reasons of understandability, the report was translated from German to English prior to this query. In the present study, MRI reports were processed by ChatGPT without prior translation.
Figure 2
Figure 2
Enrollment of liver lesions.
Figure 3
Figure 3
Basic statistics. Mean age of the analyzed patient population was 65.5 ± 10.9 years (A) 74% (n = 111) of the included high-risk patients were male (B). The median number of included lesions per radiology report was 2 [1–3], with the most common categories of 1 (80/205 reports, 39%), 2 (73/205 reports, 36%), and 3 (29/205 reports, 14%) included lesions per report, respectively (C).
Figure 4
Figure 4
LI-RADS classification performance of ChatGPT based on unstructured and structured radiology reports. Performance overview of unstructured and structured reports is shown in the top (AC) and bottom row (DF), respectively. (A/D) distribution of the LI-RADS scores, (B/E) errors between the experienced liver radiologist and ChatGPT, (C/F) percentage of correct and incorrect LI-RADS classifications by ChatGPT.

References

    1. Aung YYM, Wong DCS, Ting DSW. The promise of artificial intelligence: a review of the opportunities and challenges of artificial intelligence in healthcare. Br Med Bull. (2021) 139(1):4–15. - PubMed
    1. Antin B, Kravitz J, Martayan E. Detecting pneumonia in chest X-Rays with supervised learning. Semanticscholar.org. (2017) 2017.
    1. Marcovici PA, Taylor GA. Journal club: structured radiology reports are more complete and more effective than unstructured reports. AJR Am J Roentgenol. (2014) 203(6):1265–71. 10.2214/AJR.14.12636 - DOI - PubMed
    1. Nobel JM, Kok EM, Robben SGF. Redefining the structure of structured reporting in radiology. Insights Imaging. (2020) 11:1–5. 10.1186/S13244-019-0831-6/FIGURES/2 - DOI - PMC - PubMed
    1. Moezzi SAR, Ghaedi A, Rahmanian M, Mousavi SZ, Sami A. Application of deep learning in generating structured radiology reports: a transformer-based technique. J Digit Imaging. (2023) 36(1):80–90. 10.1007/s10278-022-00692-x - DOI - PMC - PubMed

LinkOut - more resources