Comparative Performance of Chatbots in Endodontic Clinical Decision Support: A 4-Day Accuracy and Consistency Study

doi:10.1016/j.identj.2025.100920

. 2025 Jul 27;75(5):100920.

doi: 10.1016/j.identj.2025.100920. Online ahead of print.

Comparative Performance of Chatbots in Endodontic Clinical Decision Support: A 4-Day Accuracy and Consistency Study

Mine Büker¹, Meltem Sümbüllü², Hakan Arslan³

Affiliations

¹ Faculty of Dentistry, Department of Endodontics, Mersin University, Mersin, Turkey. Electronic address: mine.bkr@mersin.edu.tr.
² Faculty of Dentistry, Department of Endodontics, Atatürk University, Erzurum, Turkey. Electronic address: meltem_endo@hotmail.com.
³ Faculty of Dentistry, Department of Endodontics, İstanbul Medeniyet University, Istanbul, Turkey. Electronic address: dt_hakan82@hotmail.com.

PMID: 40720933
PMCID: PMC12319550
DOI: 10.1016/j.identj.2025.100920

Comparative Performance of Chatbots in Endodontic Clinical Decision Support: A 4-Day Accuracy and Consistency Study

Mine Büker et al. Int Dent J. 2025.

. 2025 Jul 27;75(5):100920.

doi: 10.1016/j.identj.2025.100920. Online ahead of print.

Authors

Mine Büker¹, Meltem Sümbüllü², Hakan Arslan³

Affiliations

¹ Faculty of Dentistry, Department of Endodontics, Mersin University, Mersin, Turkey. Electronic address: mine.bkr@mersin.edu.tr.
² Faculty of Dentistry, Department of Endodontics, Atatürk University, Erzurum, Turkey. Electronic address: meltem_endo@hotmail.com.
³ Faculty of Dentistry, Department of Endodontics, İstanbul Medeniyet University, Istanbul, Turkey. Electronic address: dt_hakan82@hotmail.com.

PMID: 40720933
PMCID: PMC12319550
DOI: 10.1016/j.identj.2025.100920

Abstract

Introduction and aims: Despite the use of artificial intelligence, which is increasingly prevalent in healthcare settings, concerns remain regarding its reliability and accuracy. The study assessed the overall, difficulty level-specific, and day-to-day accuracy and consistency of 5 AI chatbots-ChatGPT-3.5, ChatGPT-4.o, Gemini 2.0 Flash, Copilot, and Copilot Pro-in answering clinically relevant endodontic questions.

Methods: Seventy-six correct/incorrect questions were developed by 2 endodontists and categorized by an expert into 3 difficulty levels (Basic [B]-, Intermediate [I]-, and Advanced [A]- level]. Twenty questions from each difficulty level were selected from a set of 74 validated questions (B, n = 26; I, n = 24; A, n = 24), resulting in a total of 60 questions. The questions were asked of the chatbots over a period of 4 days, at 3 different times each day (morning, afternoon, and evening).

Results: ChatGPT-4.o achieved the highest overall accuracy (82.5%) and perfect performance in the B-level category (95.0%), while Copilot Pro had the lowest accuracy (74.03%). Gemini and ChatGPT-3.5 showed similar overall accuracy. Gemini's accuracy significantly improved over time, whereas significant decreases were noted in the Copilot Pro model across days, and no significant change was detected in both ChatGPT models and Copilot. In the B-level category, while Copilot Pro showed a significant decrease in accuracy rates, and in the B- and I-level categories, Copilot showed a significant increase in accuracy rates over the days. In the A-level category, Gemini demonstrated a significant increase in accuracy rates over the days.

Conclusions: ChatGPT-4.o demonstrated superior performance, whereas Copilot and Copilot Pro showed insufficient accuracy. ChatGPT-3.5 and Gemini may be acceptable for general queries but require caution in more advanced cases.

Clinical relevance: ChatGPT-4.o demonstrated the highest overall accuracy and consistency in all question categories over 4 days, suggesting its potential as a reliable tool for clinical decision-making.

Keywords: Artificial intelligence; ChatGPT; Clinical decision support; Copilot; Endodontics; Gemini.

PubMed Disclaimer

Conflict of interest statement

Conflict of interest None disclosed.

Figures

Fig 1 — **Fig. 1**
Overall accuracy rate changes of chatbots on different days.

Fig 2: — **Fig. 2**
Accuracy rate changes of chatbots based on difficulty level category on different days.

See this image and copyright information in PMC

References

1. Eggmann F., Blatz MB. ChatGPT: chances and challenges for dentistry. Compend Contin Educ Dent. 2023;44:220–224. - PubMed
1. Fergus S., Botha M., Ostovar M. Evaluating academic answers generated using ChatGPT. J Chem Educ. 2023;100:1672–1675. doi: 10.1021/acs.jchemed.3c00087. - DOI
1. Abd-Alrazaq A., AlSaad R., Alhuwail D., et al. Large language models in medical education: opportunities, challenges, and future directions. JMIR Med Educ. 2023;9 doi: 10.2196/48291. - DOI - PMC - PubMed
1. Nazi Z., Peng W. Large language models in healthcare and medical domain: a review. Informatics. 2024;11:57. doi: 10.3390/informatics11030057. - DOI
1. Arslan B., Eyupoglu G., Korkut S., et al. The accuracy of AI-assisted chatbots on the annual assessment test for emergency medicine residents. J Med Surg Public Health. 2024;3 doi: 10.1016/j.glmedi.2024.100070. - DOI

LinkOut - more resources

Full Text Sources

[1] Eggmann F., Blatz MB. ChatGPT: chances and challenges for dentistry. Compend Contin Educ Dent. 2023;44:220–224. - PubMed

[2] Eggmann F., Blatz MB. ChatGPT: chances and challenges for dentistry. Compend Contin Educ Dent. 2023;44:220–224. - PubMed

[3] Fergus S., Botha M., Ostovar M. Evaluating academic answers generated using ChatGPT. J Chem Educ. 2023;100:1672–1675. doi: 10.1021/acs.jchemed.3c00087. - DOI

[4] Fergus S., Botha M., Ostovar M. Evaluating academic answers generated using ChatGPT. J Chem Educ. 2023;100:1672–1675. doi: 10.1021/acs.jchemed.3c00087. - DOI

[5] Abd-Alrazaq A., AlSaad R., Alhuwail D., et al. Large language models in medical education: opportunities, challenges, and future directions. JMIR Med Educ. 2023;9 doi: 10.2196/48291. - DOI - PMC - PubMed

[6] Abd-Alrazaq A., AlSaad R., Alhuwail D., et al. Large language models in medical education: opportunities, challenges, and future directions. JMIR Med Educ. 2023;9 doi: 10.2196/48291. - DOI - PMC - PubMed

[7] Nazi Z., Peng W. Large language models in healthcare and medical domain: a review. Informatics. 2024;11:57. doi: 10.3390/informatics11030057. - DOI

[8] Nazi Z., Peng W. Large language models in healthcare and medical domain: a review. Informatics. 2024;11:57. doi: 10.3390/informatics11030057. - DOI

[9] Arslan B., Eyupoglu G., Korkut S., et al. The accuracy of AI-assisted chatbots on the annual assessment test for emergency medicine residents. J Med Surg Public Health. 2024;3 doi: 10.1016/j.glmedi.2024.100070. - DOI

[10] Arslan B., Eyupoglu G., Korkut S., et al. The accuracy of AI-assisted chatbots on the annual assessment test for emergency medicine residents. J Med Surg Public Health. 2024;3 doi: 10.1016/j.glmedi.2024.100070. - DOI

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Comparative Performance of Chatbots in Endodontic Clinical Decision Support: A 4-Day Accuracy and Consistency Study

Affiliations

Comparative Performance of Chatbots in Endodontic Clinical Decision Support: A 4-Day Accuracy and Consistency Study

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

References

LinkOut - more resources

Full Text Sources