Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing
- PMID: 38613510
- DOI: 10.1093/ejo/cjae017
Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing
Abstract
Background: The increasing utilization of large language models (LLMs) in Generative Artificial Intelligence across various medical and dental fields, and specifically orthodontics, raises questions about their accuracy.
Objective: This study aimed to assess and compare the answers offered by four LLMs: Google's Bard, OpenAI's ChatGPT-3.5, and ChatGPT-4, and Microsoft's Bing, in response to clinically relevant questions within the field of orthodontics.
Materials and methods: Ten open-type clinical orthodontics-related questions were posed to the LLMs. The responses provided by the LLMs were assessed on a scale ranging from 0 (minimum) to 10 (maximum) points, benchmarked against robust scientific evidence, including consensus statements and systematic reviews, using a predefined rubric. After a 4-week interval from the initial evaluation, the answers were reevaluated to gauge intra-evaluator reliability. Statistical comparisons were conducted on the scores using Friedman's and Wilcoxon's tests to identify the model providing the answers with the most comprehensiveness, scientific accuracy, clarity, and relevance.
Results: Overall, no statistically significant differences between the scores given by the two evaluators, on both scoring occasions, were detected, so an average score for every LLM was computed. The LLM answers scoring the highest, were those of Microsoft Bing Chat (average score = 7.1), followed by ChatGPT 4 (average score = 4.7), Google Bard (average score = 4.6), and finally ChatGPT 3.5 (average score 3.8). While Microsoft Bing Chat statistically outperformed ChatGPT-3.5 (P-value = 0.017) and Google Bard (P-value = 0.029), as well, and Chat GPT-4 outperformed Chat GPT-3.5 (P-value = 0.011), all models occasionally produced answers with a lack of comprehensiveness, scientific accuracy, clarity, and relevance.
Limitations: The questions asked were indicative and did not cover the entire field of orthodontics.
Conclusions: Language models (LLMs) show great potential in supporting evidence-based orthodontics. However, their current limitations pose a potential risk of making incorrect healthcare decisions if utilized without careful consideration. Consequently, these tools cannot serve as a substitute for the orthodontist's essential critical thinking and comprehensive subject knowledge. For effective integration into practice, further research, clinical validation, and enhancements to the models are essential. Clinicians must be mindful of the limitations of LLMs, as their imprudent utilization could have adverse effects on patient care.
Keywords: ChatGPT; Google bard; Microsoft bing chat; large language models; orthodontics.
© The Author(s) 2024. Published by Oxford University Press on behalf of the European Orthodontic Society.
Similar articles
-
Evaluating the evidence-based potential of six large language models in paediatric dentistry: a comparative study on generative artificial intelligence.Eur Arch Paediatr Dent. 2025 Jun;26(3):527-535. doi: 10.1007/s40368-025-01012-x. Epub 2025 Feb 22. Eur Arch Paediatr Dent. 2025. PMID: 39987420 Free PMC article.
-
Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study.J Med Internet Res. 2023 Dec 28;25:e51580. doi: 10.2196/51580. J Med Internet Res. 2023. PMID: 38009003 Free PMC article.
-
Large Language Models in Hematology Case Solving: A Comparative Study of ChatGPT-3.5, Google Bard, and Microsoft Bing.Cureus. 2023 Aug 21;15(8):e43861. doi: 10.7759/cureus.43861. eCollection 2023 Aug. Cureus. 2023. PMID: 37736448 Free PMC article.
-
Harnessing artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing, and Bard in generating clinician-level bariatric surgery recommendations.Surg Obes Relat Dis. 2024 Jul;20(7):603-608. doi: 10.1016/j.soard.2024.03.011. Epub 2024 Mar 24. Surg Obes Relat Dis. 2024. PMID: 38644078 Review.
-
Performance of artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing, and Bard in the American Society for Metabolic and Bariatric Surgery textbook of bariatric surgery questions.Surg Obes Relat Dis. 2024 Jul;20(7):609-613. doi: 10.1016/j.soard.2024.04.014. Epub 2024 May 8. Surg Obes Relat Dis. 2024. PMID: 38782611 Review.
Cited by
-
Evaluating the evidence-based potential of six large language models in paediatric dentistry: a comparative study on generative artificial intelligence.Eur Arch Paediatr Dent. 2025 Jun;26(3):527-535. doi: 10.1007/s40368-025-01012-x. Epub 2025 Feb 22. Eur Arch Paediatr Dent. 2025. PMID: 39987420 Free PMC article.
-
PICOT questions and search strategies formulation: A novel approach using artificial intelligence automation.J Nurs Scholarsh. 2025 Jan;57(1):5-16. doi: 10.1111/jnu.13036. Epub 2024 Nov 24. J Nurs Scholarsh. 2025. PMID: 39582233 Free PMC article.
-
Comparing orthodontic pre-treatment information provided by large language models.BMC Oral Health. 2025 May 28;25(1):838. doi: 10.1186/s12903-025-06246-1. BMC Oral Health. 2025. PMID: 40437500 Free PMC article.
-
Comparative Performance of Chatbots in Endodontic Clinical Decision Support: A 4-Day Accuracy and Consistency Study.Int Dent J. 2025 Jul 27;75(5):100920. doi: 10.1016/j.identj.2025.100920. Online ahead of print. Int Dent J. 2025. PMID: 40720933 Free PMC article.
-
Evaluating the influence of prompt formulation on the reliability and repeatability of ChatGPT in implant-supported prostheses.PLoS One. 2025 May 30;20(5):e0323086. doi: 10.1371/journal.pone.0323086. eCollection 2025. PLoS One. 2025. PMID: 40445924 Free PMC article.
LinkOut - more resources
Full Text Sources