Leveraging multimodal large language model chatbots in oral radiology: a comprehensive evaluation using questions from a korean dental university
- PMID: 41386253
- DOI: 10.1093/dmfr/twaf083
Leveraging multimodal large language model chatbots in oral radiology: a comprehensive evaluation using questions from a korean dental university
Abstract
Objectives: This study aimed to conduct a comprehensive evaluation of general-purpose multimodal large language model (LLM) chatbots in oral radiology.
Methods: Ninety text- and image-based oral radiology questions from a Korean dental university were extracted and categorized into six educational contents and two question types. ChatGPT-4o and Gemini 2.0 Flash were evaluated with following items: accuracy with group differences across six contents (using Fisher's exact test with Bonferroni correction, p < 0.0167), answer consistency across ten repeated outputs (evaluated as the mean agreement and Fleiss' kappa coefficient), and hallucination (evaluated as the mean of the 5-point Global Quality Score assigned by two oral radiologists).
Results: Multimodal AI chatbots (ChatGPT-4o and Gemini 2.0 Flash) achieved excellent performance on text-based questions with over 80% accuracy but showed limited performance on image-based tasks, with accuracy under 30%. Additionally, image-based tasks exhibited high response variability, and hallucinations were frequently observed, providing incorrect information. These findings suggest that AI chatbots are not yet suitable for reliable use in oral radiology.
Conclusions: This study provided timely insights into the capabilities and limitations of general-purpose multimodal LLM chatbots in the oral radiology, and will serve as a foundation for more safe and effective applications of AI chatbots in the oral radiology field in the future.
Advances in knowledge: This is the first study to comprehensively assess multimodal LLM chatbots in oral radiology. It provides key insights into the performance benchmarks for AI chatbots in oral radiology, promoting the responsible and transparent integration of AI into dental education.
Keywords: Accuracy; Answer consistency; Hallucination; Multimodal large language model; Oral radiology.
© The Author(s) 2025. Published by Oxford University Press on behalf of the British Institute of Radiology and the International Association of Dentomaxillofacial Radiology. All rights reserved. For commercial re-use, please contact reprints@oup.com for reprints and translation rights for reprints. All other permissions can be obtained through our RightsLink service via the Permissions link on the article page on our site—for further information please contact journals.permissions@oup.com.
LinkOut - more resources
Full Text Sources
