Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Oct 7;28(11):575.
doi: 10.1007/s00784-024-05968-w.

Performance of large language artificial intelligence models on solving restorative dentistry and endodontics student assessments

Affiliations

Performance of large language artificial intelligence models on solving restorative dentistry and endodontics student assessments

Paul Künzle et al. Clin Oral Investig. .

Abstract

Objectives: The advent of artificial intelligence (AI) and large language model (LLM)-based AI applications (LLMAs) has tremendous implications for our society. This study analyzed the performance of LLMAs on solving restorative dentistry and endodontics (RDE) student assessment questions.

Materials and methods: 151 questions from a RDE question pool were prepared for prompting using LLMAs from OpenAI (ChatGPT-3.5,-4.0 and -4.0o) and Google (Gemini 1.0). Multiple-choice questions were sorted into four question subcategories, entered into LLMAs and answers recorded for analysis. P-value and chi-square statistical analyses were performed using Python 3.9.16.

Results: The total answer accuracy of ChatGPT-4.0o was the highest, followed by ChatGPT-4.0, Gemini 1.0 and ChatGPT-3.5 (72%, 62%, 44% and 25%, respectively) with significant differences between all LLMAs except GPT-4.0 models. The performance on subcategories direct restorations and caries was the highest, followed by indirect restorations and endodontics.

Conclusions: Overall, there are large performance differences among LLMAs. Only the ChatGPT-4 models achieved a success ratio that could be used with caution to support the dental academic curriculum.

Clinical relevance: While LLMAs could support clinicians to answer dental field-related questions, this capacity depends strongly on the employed model. The most performant model ChatGPT-4.0o achieved acceptable accuracy rates in some subject sub-categories analyzed.

Keywords: Artificial intelligence; ChatGPT; Gemini; GenAI; Natural language processing.

PubMed Disclaimer

Conflict of interest statement

No declared conflicts of interest exist among all authors of this study neither regarding authorship nor publication of this manuscript.

Figures

Fig. 1
Fig. 1
Flowchart for question analysis using four different LLMAs. Questions were selected from a RDE test question pool and subsequently subject to analysis. Questions were entered into the text field of the respective LLMA, and answers systematically collected. Afterwards, the results were compiled and statistically interpreted
Fig. 2
Fig. 2
Relative answer accuracy of different LLMAs. For each LLMA, the percentage of correct answers is shown by a green bar, while incorrect answers are displayed as red bars. Significant differences between the LLMAs are indicated with different letters (Chi2; p < 0.001 for all significant differences between LLMAs except GPT-3.5 and Gemini 1.0 (p < 0.01)). LLMAs are labelled using short forms of their respective names (GPT-4.0o: OpenAI ChatGPT-4.0o, GPT-4: OpenAI ChatGPT-4.0, GPT-3.5: OpenAI ChatGPT-3.5, Gemini 1.0: Google Gemini 1.0) and sorted by their developer and release date (OpenAI GPT-4.0o: May 13, 2024; GPT-4.0: March 14, 2023; GPT-3.5: November 30, 2022; Google Gemini 1.0: December 15, 2023)
Fig. 3
Fig. 3
Relative answer accuracy of different LLMAs on RDE question categories. For each individual subject direct restorations (n = 56), endodontics (n = 25), indirect restorations (n = 12) and caries (n = 58), individual stacked bar charts indicate the relative share of correctly (green) and incorrectly (red) answered test questions. The average performance rate across all LLMAs assessed was best for the subcategory direct restorations (48%), followed by caries (45%), indirect restorations (44%) and endodontics (31%)
Fig. 4
Fig. 4
Sample question prompt types used for the assessment. Quantitative questions were entered into the LLMA using the same prompt, but different answer choices. In style A, the LLMA had to select an answer choice that refers to statements made for the question asked. Style B directly asked for the correct answer choice for the question posed. Correct answers are marked in bold font
Fig. 5
Fig. 5
Answer accuracy sorted by question type in subcategory direct restorations. In the subcategory direct restorations, n = 50 questions of type A and n = 6 questions of type B were compared across different LLMAs
Fig. 6
Fig. 6
Examples of verbatim question prompts used in LLMAs. The questions used in prompts were kept in German as the original language for the assessment. The prompt for the application was phrased in English.

References

    1. OpenAI (2022) Introducing ChatGPT. https://openai.com/blog/chatgpt. Accessed May 1st 2024
    1. Google (2023) An important next step on our AI journey. https://blog.google/technology/ai/bard-google-ai-search-updates/. Accessed May 1st 2024
    1. Hoch CC, Wollenberg B, Luers JC, Knoedler S, Knoedler L, Frank K, Cotofana S, Alfertshofer M (2023) ChatGPT’s quiz skills in different otolaryngology subspecialties: an analysis of 2576 single-choice and multiple-choice board certification preparation questions. Eur Arch Otorhinolaryngol 280:4271–4278. 10.1007/s00405-023-08051-4 - DOI - PMC - PubMed
    1. Knoedler S, Sofo G, Kern B, Frank K, Cotofana S, von Isenburg S, Konneker S, Mazzarone F, Dorafshar AH, Knoedler L, Alfertshofer M (2024) Modern Machiavelli? The illusion of ChatGPT-generated patient reviews in plastic and aesthetic surgery based on 9000 review classifications. J Plast Reconstr Aesthet Surg 88:99–108. 10.1016/j.bjps.2023.10.119 - DOI - PubMed
    1. Ting DSJ, Tan TF, Ting DSW (2024) ChatGPT in ophthalmology: the dawn of a new era? Eye (Lond) 38:4–7. 10.1038/s41433-023-02619-4 - DOI - PMC - PubMed

LinkOut - more resources