Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society

Yoshitaka Toyama¹, Ayaka Harigai^{2

3

4}, Mirei Abe², Mitsutoshi Nagano⁵, Masahiro Kawabata², Yasuhiro Seki⁶, Kei Takase⁴

Affiliations

¹ Department of Diagnostic Radiology, Tohoku University Hospital, 1-1 Seiryo-Machi, Aoba-Ku, Sendai, 980-8575, Japan. ytoyama0818@gmail.com.
² Department of Diagnostic Radiology, Tohoku University Hospital, 1-1 Seiryo-Machi, Aoba-Ku, Sendai, 980-8575, Japan.
³ Department of Radiology, Tohoku Medical and Pharmaceutical University, Sendai, Japan.
⁴ Department of Diagnostic Radiology, Tohoku University Graduate School of Medicine, Sendai, Japan.
⁵ School of Medicine, Tohoku University, Sendai, Japan.
⁶ Department of Radiation Oncology, Tohoku University Hospital, Sendai, Japan.

PMID: 37792149
PMCID: PMC10811006
DOI: 10.1007/s11604-023-01491-2

Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society

Yoshitaka Toyama et al. Jpn J Radiol. 2024 Feb.

. 2024 Feb;42(2):201-207.

doi: 10.1007/s11604-023-01491-2. Epub 2023 Oct 4.

Authors

Yoshitaka Toyama¹, Ayaka Harigai^{2

3

4}, Mirei Abe², Mitsutoshi Nagano⁵, Masahiro Kawabata², Yasuhiro Seki⁶, Kei Takase⁴

Affiliations

¹ Department of Diagnostic Radiology, Tohoku University Hospital, 1-1 Seiryo-Machi, Aoba-Ku, Sendai, 980-8575, Japan. ytoyama0818@gmail.com.
² Department of Diagnostic Radiology, Tohoku University Hospital, 1-1 Seiryo-Machi, Aoba-Ku, Sendai, 980-8575, Japan.
³ Department of Radiology, Tohoku Medical and Pharmaceutical University, Sendai, Japan.
⁴ Department of Diagnostic Radiology, Tohoku University Graduate School of Medicine, Sendai, Japan.
⁵ School of Medicine, Tohoku University, Sendai, Japan.
⁶ Department of Radiation Oncology, Tohoku University Hospital, Sendai, Japan.

PMID: 37792149
PMCID: PMC10811006
DOI: 10.1007/s11604-023-01491-2

Abstract

Purpose: Herein, we assessed the accuracy of large language models (LLMs) in generating responses to questions in clinical radiology practice. We compared the performance of ChatGPT, GPT-4, and Google Bard using questions from the Japan Radiology Board Examination (JRBE).

Materials and methods: In total, 103 questions from the JRBE 2022 were used with permission from the Japan Radiological Society. These questions were categorized by pattern, required level of thinking, and topic. McNemar's test was used to compare the proportion of correct responses between the LLMs. Fisher's exact test was used to assess the performance of GPT-4 for each topic category.

Results: ChatGPT, GPT-4, and Google Bard correctly answered 40.8% (42 of 103), 65.0% (67 of 103), and 38.8% (40 of 103) of the questions, respectively. GPT-4 significantly outperformed ChatGPT by 24.2% (p < 0.001) and Google Bard by 26.2% (p < 0.001). In the categorical analysis by level of thinking, GPT-4 correctly answered 79.7% of the lower-order questions, which was significantly higher than ChatGPT or Google Bard (p < 0.001). The categorical analysis by question pattern revealed GPT-4's superiority over ChatGPT (67.4% vs. 46.5%, p = 0.004) and Google Bard (39.5%, p < 0.001) in the single-answer questions. The categorical analysis by topic revealed that GPT-4 outperformed ChatGPT (40%, p = 0.013) and Google Bard (26.7%, p = 0.004). No significant differences were observed between the LLMs in the categories not mentioned above. The performance of GPT-4 was significantly better in nuclear medicine (93.3%) than in diagnostic radiology (55.8%; p < 0.001). GPT-4 also performed better on lower-order questions than on higher-order questions (79.7% vs. 45.5%, p < 0.001).

Conclusion: ChatGPTplus based on GPT-4 scored 65% when answering Japanese questions from the JRBE, outperforming ChatGPT and Google Bard. This highlights the potential of using LLMs to address advanced clinical questions in the field of radiology in Japan.

Keywords: Bard; ChatGPT; GPT-4; Japan Radiology Society.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflicts of interest.

Figures

**Fig. 1**
Response samples of correct answers generated by each LLM on a JRBE question. A Manually typed question into a prompt and response generated by GPT-4; A1: English version of (A). B ChatGPT response. C Google Bard response. The models’ responses varied in structure; most include an overview of the topic relevant to the question (A2), a representation of the alternatives (A3), and the answer and its justification (A4). Sometimes, the response simply states the answer (B); sometimes, the answer is generated with an explanation (C)

**Fig. 2**
Response samples presenting “hallucination” generated by GPT-4 on a JRBE 2022 question. A Manually typed question into prompt and response; A English version of the question. The response is structured in the same way as that in Fig. 1. B Overview of the topic related to the question. C Representation of the provided alternatives; and D answer and its justification. In this response, a wrong answer and its justification are presented in a confident, convincing tone, which is called “hallucination”

See this image and copyright information in PMC

References

1. Usage statistics of content languages for websites. https://w3techs.com/technologies/overview/content_language
1. Japan Radiological Society. http://www.radiology.jp
1. Bard-Chat based AI tool from Google, powered by PaLM 2. https://bard.google.com
1. Ali R, Tang OY, Connolly ID, Fridley JS, Shin JH, Zadnik Sullivan PL, et al. Performance of ChatGPT, GPT-4, and Google Bard on a neurosurgery oral boards preparation question bank. Neurosurgery. 2023 doi: 10.1227/neu.0000000000002551. - DOI - PubMed
1. Alkaissi H, McFarlane SI. Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus. 2023;15:e35179. doi: 10.7759/cureus.35179. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society

Affiliations

Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources