Generative pretrained transformer-4, an artificial intelligence text predictive model, has a high capability for passing novel written radiology exam questions

Avnish Sood¹, Nina Mansoor², Caroline Memmi³, Magnus Lynch^{4

5}, Jeremy Lynch⁶

Affiliations

¹ King's College London, Strand, London, WC2R 2LS, UK.
² Department of Neuroradiology, Kings College Hospital, Denmark Hill, London, SE59RS, UK.
³ Imperial College London, Exhibition Road, London, SW7 2AZ, UK.
⁴ King's College London Centre for Stem Cells and Regenerative Medicine, Guy's Hospital, Great Maze Pond, London, UK.
⁵ St John's Institute of Dermatology, King's College London, London, UK.
⁶ Department of Neuroradiology, Kings College Hospital, Denmark Hill, London, SE59RS, UK. jeremy.lynch@gmail.com.

PMID: 38381363
DOI: 10.1007/s11548-024-03071-9

Generative pretrained transformer-4, an artificial intelligence text predictive model, has a high capability for passing novel written radiology exam questions

Avnish Sood et al. Int J Comput Assist Radiol Surg. 2024 Apr.

. 2024 Apr;19(4):645-653.

doi: 10.1007/s11548-024-03071-9. Epub 2024 Feb 21.

Authors

Avnish Sood¹, Nina Mansoor², Caroline Memmi³, Magnus Lynch^{4

5}, Jeremy Lynch⁶

Affiliations

¹ King's College London, Strand, London, WC2R 2LS, UK.
² Department of Neuroradiology, Kings College Hospital, Denmark Hill, London, SE59RS, UK.
³ Imperial College London, Exhibition Road, London, SW7 2AZ, UK.
⁴ King's College London Centre for Stem Cells and Regenerative Medicine, Guy's Hospital, Great Maze Pond, London, UK.
⁵ St John's Institute of Dermatology, King's College London, London, UK.
⁶ Department of Neuroradiology, Kings College Hospital, Denmark Hill, London, SE59RS, UK. jeremy.lynch@gmail.com.

PMID: 38381363
DOI: 10.1007/s11548-024-03071-9

Abstract

Purpose: AI-image interpretation, through convolutional neural networks, shows increasing capability within radiology. These models have achieved impressive performance in specific tasks within controlled settings, but possess inherent limitations, such as the inability to consider clinical context. We assess the ability of large language models (LLMs) within the context of radiology specialty exams to determine whether they can evaluate relevant clinical information.

Methods: A database of questions was created with official sample, author written, and textbook questions based on the Royal College of Radiology (United Kingdom) FRCR 2A and American Board of Radiology (ABR) Certifying examinations. The questions were input into the Generative Pretrained Transformer (GPT) versions 3 and 4, with prompting to answer the questions.

Results: One thousand seventy-two questions were evaluated by GPT-3 and GPT-4. 495 (46.2%) were for the FRCR 2A and 577 (53.8%) were for the ABR exam. There were 890 single best answers (SBA), and 182 true/false questions. GPT-4 was correct in 629/890 (70.7%) SBA and 151/182 (83.0%) true/false questions. There was no degradation on author written questions. GPT-4 performed significantly better than GPT-3 which selected the correct answer in 282/890 (31.7%) SBA and 111/182 (61.0%) true/false questions. Performance of GPT-4 was similar across both examinations for all categories of question.

Conclusion: The newest generation of LLMs, GPT-4, demonstrates high capability in answering radiology exam questions. It shows marked improvement from GPT-3, suggesting further improvements in accuracy are possible. Further research is needed to explore the clinical applicability of these AI models in real-world settings.

Keywords: Artificial intelligence; Image interpretation; Large language model; Radiology examination.

PubMed Disclaimer

References

1. Kelly BS, Judge C, Bollard SM, Clifford SM, Healy GM, Aziz A, Mathur P, Islam S, Yeom KW, Lawlor A, Killeen RP (2022) Radiology artificial intelligence: a systematic review and evaluation of methods (RAISE). Eur Radiol 32(11):7998–8007. https://doi.org/10.1007/s00330-022-08784-6 - DOI - PubMed - PMC
1. Aggarwal R, Sounderajah V, Martin G, Ting DSW, Karthikesalingam A, King D, Ashrafian H, Darzi A (2021) Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis. npj Digit Med 4(1):65. https://doi.org/10.1038/s41746-021-00438-z - DOI - PubMed - PMC
1. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need [Internet]. Accessed 2023 Apr 16. Available from: https://arxiv.org/abs/1706.03762
1. Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, Chartash D (2023) How does ChatGPT perform on the united states medical licensing examination? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ 9:e45312 - DOI - PubMed - PMC
1. OpenAI. (2023) GPT-4 technical report. https://cdn.openai.com/papers/gpt-4.pdf

MeSH terms

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Springer

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Generative pretrained transformer-4, an artificial intelligence text predictive model, has a high capability for passing novel written radiology exam questions

Affiliations

Generative pretrained transformer-4, an artificial intelligence text predictive model, has a high capability for passing novel written radiology exam questions

Authors

Affiliations

Abstract

References

MeSH terms

LinkOut - more resources

Full Text Sources