Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Feb;42(2):201-207.
doi: 10.1007/s11604-023-01491-2. Epub 2023 Oct 4.

Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society

Affiliations

Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society

Yoshitaka Toyama et al. Jpn J Radiol. 2024 Feb.

Abstract

Purpose: Herein, we assessed the accuracy of large language models (LLMs) in generating responses to questions in clinical radiology practice. We compared the performance of ChatGPT, GPT-4, and Google Bard using questions from the Japan Radiology Board Examination (JRBE).

Materials and methods: In total, 103 questions from the JRBE 2022 were used with permission from the Japan Radiological Society. These questions were categorized by pattern, required level of thinking, and topic. McNemar's test was used to compare the proportion of correct responses between the LLMs. Fisher's exact test was used to assess the performance of GPT-4 for each topic category.

Results: ChatGPT, GPT-4, and Google Bard correctly answered 40.8% (42 of 103), 65.0% (67 of 103), and 38.8% (40 of 103) of the questions, respectively. GPT-4 significantly outperformed ChatGPT by 24.2% (p < 0.001) and Google Bard by 26.2% (p < 0.001). In the categorical analysis by level of thinking, GPT-4 correctly answered 79.7% of the lower-order questions, which was significantly higher than ChatGPT or Google Bard (p < 0.001). The categorical analysis by question pattern revealed GPT-4's superiority over ChatGPT (67.4% vs. 46.5%, p = 0.004) and Google Bard (39.5%, p < 0.001) in the single-answer questions. The categorical analysis by topic revealed that GPT-4 outperformed ChatGPT (40%, p = 0.013) and Google Bard (26.7%, p = 0.004). No significant differences were observed between the LLMs in the categories not mentioned above. The performance of GPT-4 was significantly better in nuclear medicine (93.3%) than in diagnostic radiology (55.8%; p < 0.001). GPT-4 also performed better on lower-order questions than on higher-order questions (79.7% vs. 45.5%, p < 0.001).

Conclusion: ChatGPTplus based on GPT-4 scored 65% when answering Japanese questions from the JRBE, outperforming ChatGPT and Google Bard. This highlights the potential of using LLMs to address advanced clinical questions in the field of radiology in Japan.

Keywords: Bard; ChatGPT; GPT-4; Japan Radiology Society.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflicts of interest.

Figures

Fig. 1
Fig. 1
Response samples of correct answers generated by each LLM on a JRBE question. A Manually typed question into a prompt and response generated by GPT-4; A1: English version of (A). B ChatGPT response. C Google Bard response. The models’ responses varied in structure; most include an overview of the topic relevant to the question (A2), a representation of the alternatives (A3), and the answer and its justification (A4). Sometimes, the response simply states the answer (B); sometimes, the answer is generated with an explanation (C)
Fig. 2
Fig. 2
Response samples presenting “hallucination” generated by GPT-4 on a JRBE 2022 question. A Manually typed question into prompt and response; A English version of the question. The response is structured in the same way as that in Fig. 1. B Overview of the topic related to the question. C Representation of the provided alternatives; and D answer and its justification. In this response, a wrong answer and its justification are presented in a confident, convincing tone, which is called “hallucination”

References

    1. Usage statistics of content languages for websites. https://w3techs.com/technologies/overview/content_language
    1. Japan Radiological Society. http://www.radiology.jp
    1. Bard-Chat based AI tool from Google, powered by PaLM 2. https://bard.google.com
    1. Ali R, Tang OY, Connolly ID, Fridley JS, Shin JH, Zadnik Sullivan PL, et al. Performance of ChatGPT, GPT-4, and Google Bard on a neurosurgery oral boards preparation question bank. Neurosurgery. 2023 doi: 10.1227/neu.0000000000002551. - DOI - PubMed
    1. Alkaissi H, McFarlane SI. Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus. 2023;15:e35179. doi: 10.7759/cureus.35179. - DOI - PMC - PubMed

LinkOut - more resources