Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2024 Oct;313(1):e241139.
doi: 10.1148/radiol.241139.

Comparing Commercial and Open-Source Large Language Models for Labeling Chest Radiograph Reports

Affiliations
Comparative Study

Comparing Commercial and Open-Source Large Language Models for Labeling Chest Radiograph Reports

Felix J Dorfner et al. Radiology. 2024 Oct.

Abstract

Background Rapid advances in large language models (LLMs) have led to the development of numerous commercial and open-source models. While recent publications have explored OpenAI's GPT-4 to extract information of interest from radiology reports, there has not been a real-world comparison of GPT-4 to leading open-source models. Purpose To compare different leading open-source LLMs to GPT-4 on the task of extracting relevant findings from chest radiograph reports. Materials and Methods Two independent datasets of free-text radiology reports from chest radiograph examinations were used in this retrospective study performed between February 2, 2024, and February 14, 2024. The first dataset consisted of reports from the ImaGenome dataset, providing reference standard annotations from the MIMIC-CXR database acquired between 2011 and 2016. The second dataset consisted of randomly selected reports created at the Massachusetts General Hospital between July 2019 and July 2021. In both datasets, the commercial models GPT-3.5 Turbo and GPT-4 were compared with open-source models that included Mistral-7B and Mixtral-8 × 7B (Mistral AI), Llama 2-13B and Llama 2-70B (Meta), and Qwen1.5-72B (Alibaba Group), as well as CheXbert and CheXpert-labeler (Stanford ML Group), in their ability to accurately label the presence of multiple findings in radiograph text reports using zero-shot and few-shot prompting. The McNemar test was used to compare F1 scores between models. Results On the ImaGenome dataset (n = 450), the open-source model with the highest score, Llama 2-70B, achieved micro F1 scores of 0.97 and 0.97 for zero-shot and few-shot prompting, respectively, compared with the GPT-4 F1 scores of 0.98 and 0.98 (P > .99 and < .001 for superiority of GPT-4). On the institutional dataset (n = 500), the open-source model with the highest score, an ensemble model, achieved micro F1 scores of 0.96 and 0.97 for zero-shot and few-shot prompting, respectively, compared with the GPT-4 F1 scores of 0.98 and 0.97 (P < .001 and > .99 for superiority of GPT-4). Conclusion Although GPT-4 was superior to open-source models in zero-shot report labeling, few-shot prompting with a small number of example reports closely matched the performance of GPT-4. The benefit of few-shot prompting varied across datasets and models. © RSNA, 2024 Supplemental material is available for this article.

PubMed Disclaimer

Conflict of interest statement

Disclosures of conflicts of interest: F.J.D. No relevant relationships. L.J. No relevant relationships. L.D. No relevant relationships. F.A.M. No relevant relationships. T.R.B. No relevant relationships. M.C.C. No relevant relationships. F.B. No relevant relationships. L.C.A. Patents planned, issued, or pending with Klinikum rechts der Isar/Institut. J.S. No relevant relationships. T.S. No relevant relationships. A.E.K. Consulting fees from Guidepoint. J.M. Employee of Microsoft. K.K.B. Grants from the European Union (EU) (101079894) and Wilhelm-Sander Foundation; lecture payments from Canon Medical Systems and GE HealthCare; advisor for the EU Horizon 2020 LifeChamps project (875329) and EU IHI Project IMAGIO (101112053). C.P.B. Institutional research support from National Institutes of Health, EU Horizon 2020 program, Rappaport Foundation, and Michael J Fox Foundation; meeting and/or travel support from the Society of Interventional Radiology.

Figures

None
Graphical abstract
Flowchart shows chest radiograph reports and relevant findings from
the reference standard ImaGenome dataset (n = 450) and the institutional
dataset used for testing (n = 500). The institutional dataset consisted of
randomly selected reports created at the Massachusetts General Hospital. Of
the 500 reports from the ImaGenome dataset, 50 were randomly chosen as a
subset from which reports were randomly selected to serve as examples in the
few-shot prompts and then excluded from the final analysis. Of the 540
reports from the institutional dataset, 40 reports were randomly chosen as a
subset from which reports were randomly selected to serve as examples in the
few-shot prompts and subsequently excluded from the final analysis. This
left 500 reports in the final institutional test set. An additional set of
111 reports was used as a validation set for prompt development; these
reports were also not included in the final test set. Enl. =
enlarged.
Figure 1:
Flowchart shows chest radiograph reports and relevant findings from the reference standard ImaGenome dataset (n = 450) and the institutional dataset used for testing (n = 500). The institutional dataset consisted of randomly selected reports created at the Massachusetts General Hospital. Of the 500 reports from the ImaGenome dataset, 50 were randomly chosen as a subset from which reports were randomly selected to serve as examples in the few-shot prompts and then excluded from the final analysis. Of the 540 reports from the institutional dataset, 40 reports were randomly chosen as a subset from which reports were randomly selected to serve as examples in the few-shot prompts and subsequently excluded from the final analysis. This left 500 reports in the final institutional test set. An additional set of 111 reports was used as a validation set for prompt development; these reports were also not included in the final test set. Enl. = enlarged.
(A) Schematic shows the workflow for labeling unstructured data with a
local large language model (LLM) and a cloud-hosted LLM. The prompt shown
here was used to label the four relevant findings in the ImaGenome dataset.
(B) Graphic shows the components of zero-shot and few-shot prompting. The
zero-shot prompt consists of a brief task description along with a template
outlining the desired JavaScript Object Notation (JSON) format for the
response. The few-shot prompt additionally provides example reports along
with their corresponding output labels. (C) Radar chart shows micro F1
scores for the binary classification task, in which the LLM model (GPT-4
[OpenAI], Qwen1.5–72B [Alibaba Group], Mixtral–8 × 7B
[Mistral AI], Llama 2–70B [Meta AI]) predictions of
“yes” and “maybe” are collapsed into
“yes,” on the institutional test set using few-shot prompting.
The findings with the best F1 scores on this task were atelectasis (0.98),
fracture (0.93), enl. (enlarged) cardiomediastinum (0.96), support devices
(0.92), pneumothorax (1.0), pneumonia (0.93), pleural effusion (0.99),
pleural other (0.82), lung opacity (0.97), lung lesion (0.80), edema (0.96),
consolidation (0.90), and cardiomegaly (0.91). The F1 score was calculated
as the harmonic mean of precision (also known as positive predictive value)
and recall (also known as sensitivity). The micro F1 score was computed by
aggregating the true-positive, false-negative, and false-positive findings
across all classes.
Figure 2:
(A) Schematic shows the workflow for labeling unstructured data with a local large language model (LLM) and a cloud-hosted LLM. The prompt shown here was used to label the four relevant findings in the ImaGenome dataset. (B) Graphic shows the components of zero-shot and few-shot prompting. The zero-shot prompt consists of a brief task description along with a template outlining the desired JavaScript Object Notation (JSON) format for the response. The few-shot prompt additionally provides example reports along with their corresponding output labels. (C) Radar chart shows micro F1 scores for the binary classification task, in which the LLM model (GPT-4 [OpenAI], Qwen1.5–72B [Alibaba Group], Mixtral–8 × 7B [Mistral AI], Llama 2–70B [Meta AI]) predictions of “yes” and “maybe” are collapsed into “yes,” on the institutional test set using few-shot prompting. The findings with the best F1 scores on this task were atelectasis (0.98), fracture (0.93), enl. (enlarged) cardiomediastinum (0.96), support devices (0.92), pneumothorax (1.0), pneumonia (0.93), pleural effusion (0.99), pleural other (0.82), lung opacity (0.97), lung lesion (0.80), edema (0.96), consolidation (0.90), and cardiomegaly (0.91). The F1 score was calculated as the harmonic mean of precision (also known as positive predictive value) and recall (also known as sensitivity). The micro F1 score was computed by aggregating the true-positive, false-negative, and false-positive findings across all classes.

References

    1. Irvin J , Rajpurkar P , Ko M , et al . CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison . arXiv 1901.07031 [preprint] https://arxiv.org/abs/1901.07031. Published January 21, 2019. Accessed January 30, 2024 .
    1. Yang E , Li MD , Raghavan S , et al . Transformer versus traditional natural language processing: how much data is enough for automated radiology report classification? Br J Radiol 2023. ; 96 ( 1149 ): 20220769 . - PMC - PubMed
    1. Smit A , Jain S , Rajpurkar P , Pareek A , Ng AY , Lungren MP . CheXbert: Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT . arXiv 2004.09167 [preprint] https://arxiv.org/abs/2004.09167. Published April 20, 2020. Accessed February 29, 2024 .
    1. Liu Q , Hyland S , Bannur S , et al . Exploring the Boundaries of GPT-4 in Radiology . arXiv 2310.14573 [preprint] https://arxiv.org/abs/2310.14573. Published October 23, 2023. Accessed January 23, 2023 .
    1. Adams LC , Truhn D , Busch F , et al . Leveraging GPT-4 for Post Hoc Transformation of Free-text Radiology Reports into Structured Reporting: A Multilingual Feasibility Study . Radiology 2023. ; 307 ( 4 ): e230725 . - PubMed

Publication types

LinkOut - more resources