Evaluating Large Language Models for Burning Mouth Syndrome Diagnosis
- PMID: 40124539
- PMCID: PMC11930279
- DOI: 10.2147/JPR.S509845
Evaluating Large Language Models for Burning Mouth Syndrome Diagnosis
Abstract
Introduction: Large language models have been proposed as diagnostic aids across various medical fields, including dentistry. Burning mouth syndrome, characterized by burning sensations in the oral cavity without identifiable cause, poses diagnostic challenges. This study explores the diagnostic accuracy of large language models in identifying burning mouth syndrome, hypothesizing potential limitations.
Materials and methods: Clinical vignettes of 100 synthesized burning mouth syndrome cases were evaluated using three large language models (ChatGPT-4o, Gemini Advanced 1.5 Pro, and Claude 3.5 Sonnet). Each vignette included patient demographics, symptoms, and medical history. Large language models were prompted to provide a primary diagnosis, differential diagnoses, and their reasoning. Accuracy was determined by comparing their responses with expert evaluations.
Results: ChatGPT and Claude achieved an accuracy rate of 99%, while Gemini's accuracy was 89% (p < 0.001). Misdiagnoses included Persistent Idiopathic Facial Pain and combined diagnoses with inappropriate conditions. Differences were also observed in reasoning patterns and additional data requests across the large language models.
Discussion: Despite high overall accuracy, the models exhibited variations in reasoning approaches and occasional errors, underscoring the importance of clinician oversight. Limitations include the synthesized nature of vignettes, potential over-reliance on exclusionary criteria, and challenges in differentiating overlapping disorders.
Conclusion: Large language models demonstrate strong potential as supplementary diagnostic tools for burning mouth syndrome, especially in settings lacking specialist expertise. However, their reliability depends on thorough patient assessment and expert verification. Integrating large language models into routine diagnostics could enhance early detection and management, ultimately improving clinical decision-making for dentists and specialists alike.
Keywords: artificial intelligence; burning mouth syndrome; dentistry; diagnostic accuracy; large language models.
© 2025 Suga et al.
Conflict of interest statement
The authors declare that there are no competing interests.
Figures



Similar articles
-
Evaluation of Advanced Artificial Intelligence Algorithms' Diagnostic Efficacy in Acute Ischemic Stroke: A Comparative Analysis of ChatGPT-4o and Claude 3.5 Sonnet Models.J Clin Med. 2025 Jan 17;14(2):571. doi: 10.3390/jcm14020571. J Clin Med. 2025. PMID: 39860577 Free PMC article.
-
From open-ended to multiple-choice: evaluating diagnostic performance and consistency of ChatGPT, Google Gemini and Claude AI.Wiad Lek. 2024;77(10):1852-1856. doi: 10.36740/WLek/195125. Wiad Lek. 2024. PMID: 39661873
-
Information from digital and human sources: A comparison of chatbot and clinician responses to orthodontic questions.Am J Orthod Dentofacial Orthop. 2025 May 6:S0889-5406(25)00156-8. doi: 10.1016/j.ajodo.2025.04.008. Online ahead of print. Am J Orthod Dentofacial Orthop. 2025. PMID: 40327024
-
Appropriateness of Thyroid Nodule Cancer Risk Assessment and Management Recommendations Provided by Large Language Models.J Imaging Inform Med. 2025 Mar 3. doi: 10.1007/s10278-025-01454-1. Online ahead of print. J Imaging Inform Med. 2025. PMID: 40032759
-
Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in "Diagnosis Please" cases.Jpn J Radiol. 2024 Nov;42(11):1231-1235. doi: 10.1007/s11604-024-01619-y. Epub 2024 Jul 1. Jpn J Radiol. 2024. PMID: 38954192 Free PMC article.
Cited by
-
Evaluating performance of large language models for atrial fibrillation management using different prompting strategies and languages.Sci Rep. 2025 May 30;15(1):19028. doi: 10.1038/s41598-025-04309-5. Sci Rep. 2025. PMID: 40447746 Free PMC article.
References
LinkOut - more resources
Full Text Sources