. 2025 Apr;20(2):895-900.

doi: 10.1016/j.jds.2024.08.020. Epub 2024 Sep 11.

Can a large language model create acceptable dental board-style examination questions? A cross-sectional prospective study

Hak-Sun Kim¹, Gyu-Tae Kim²

Affiliations

¹ Department of Oral and Maxillofacial Radiology, Kyung Hee University Dental Hospital, Seoul, Republic of Korea.
² Department of Oral and Maxillofacial Radiology, College of Dentistry, Kyung Hee University, Seoul, Republic of Korea.

PMID: 40224064
PMCID: PMC11993092
DOI: 10.1016/j.jds.2024.08.020

Can a large language model create acceptable dental board-style examination questions? A cross-sectional prospective study

Hak-Sun Kim et al. J Dent Sci. 2025 Apr.

. 2025 Apr;20(2):895-900.

doi: 10.1016/j.jds.2024.08.020. Epub 2024 Sep 11.

Authors

Hak-Sun Kim¹, Gyu-Tae Kim²

Affiliations

¹ Department of Oral and Maxillofacial Radiology, Kyung Hee University Dental Hospital, Seoul, Republic of Korea.
² Department of Oral and Maxillofacial Radiology, College of Dentistry, Kyung Hee University, Seoul, Republic of Korea.

PMID: 40224064
PMCID: PMC11993092
DOI: 10.1016/j.jds.2024.08.020

Abstract

Background/purpose: Numerous studies have shown that large language models (LLMs) can score above the passing grade on various board examinations. Therefore, this study aimed to evaluate national dental board-style examination questions created by an LLM versus those created by human experts using item analysis.

Materials and methods: This study was conducted in June 2024 and included senior dental students (n = 30) who participated voluntarily. An LLM, ChatGPT 4o, was used to generate 44 national dental board-style examination questions based on textbook content. Twenty questions for the LLM set were randomly selected after removing false questions. Two experts created another set of 20 questions based on the same content and in the same style as the LLM. Participating students simultaneously answered a total of 40 questions divided into two sets using Google Forms in the classroom. The responses were analyzed to assess difficulty, discrimination index, and distractor efficiency. Statistical comparisons were performed using the Wilcoxon signed rank test or linear-by-linear association test, with a confidence level of 95%.

Results: The response rate was 100%. The median difficulty indices of the LLM and human set were 55.00% and 50.00%, both within the range of "excellent" range. The median discrimination indices were 0.29 for the LLM set and 0.14 for the human set. Both sets had a median distractor efficiency of 80.00%. The differences in all criteria were not statistically significant (P > 0.050).

Conclusion: The LLM can create national board-style examination questions of equivalent quality to those created by human experts.

Keywords: Artificial intelligence; Dental education; Examination questions; Natural language processing; Professional competence.

PubMed Disclaimer

Conflict of interest statement

The author has no conflicts of interest relevant to this article.

Figures

**Figure 1**
Schematic diagram of the overall process of this study. LLM, large language model.

**Figure 2**
Example questions based on knowledge of the biological effects of ionizing radiation. (A) Large language model and (B) human sets.

**Figure 3**
Plots of discrimination indices (Y axis) against difficulty indices (X-axis). (A) Large language model set and (B) human sets.

**Figure 4**
Number of non-functioning distractors in large language model and human sets.

See this image and copyright information in PMC

References

1. De Fauw J., Ledsam J.R., Romera-Paredes B., et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat Med. 2018;24:1342–1350. - PubMed
1. Lee C., Ha E.G., Choi Y.J., Jeon K.J., Han S.S. Synthesis of T2-weighted images from proton density images using a generative adversarial network in a temporomandibular joint magnetic resonance imaging protocol. Imaging Sci Dent. 2022;52:393–398. - PMC - PubMed
1. Lampinen A.K., Dasgupta I., Chan S.C.Y., et al. Language models show human-like content effects on reasoning tasks. arXiv. 2022 2207.07051. - PMC - PubMed
1. Kim H.S., Ha E.G., Kim Y.H., Jeon K.J., Lee C., Han S.S. Transfer learning in a deep convolutional neural network for implant fixture classification: a pilot study. Imaging Sci Dent. 2022;52:219–224. - PMC - PubMed
1. Jamwal A., Agrawal R., Sharma M. Deep learning for manufacturing sustainability: models, applications in Industry 4.0 and implications. Int J Inf Manag Data Insights. 2022;2

LinkOut - more resources

Full Text Sources
- Elsevier Science
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Can a large language model create acceptable dental board-style examination questions? A cross-sectional prospective study

Affiliations

Can a large language model create acceptable dental board-style examination questions? A cross-sectional prospective study

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

References

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

References

Related information

LinkOut - more resources

Full Text Sources