Evaluation of Multiple-Choice Tests in Head and Neck Ultrasound Created by Physicians and Large Language Models

Jacob P S Nielsen¹, August Krogh Mikkelsen^{1

2}, Julian Kuenzel³, Merry E Sebelik^{4

5}, Gitta Madani⁶, Tsung-Lin Yang^{7

8}, Tobias Todsen^{1

2

9}

Affiliations

¹ Department of Otorhinolaryngology, Head and Neck Surgery and Audiology, Copenhagen University Hospital (Rigshospitalet), 2100 Copenhagen, Denmark.
² Department of Clinical Medicine, University of Copenhagen, 2100 Copenhagen, Denmark.
³ Department of Otorhinolaryngology, Head and Neck Surgery, University Hospital Regensburg, 93053 Regensburg, Germany.
⁴ Department of Otolaryngology-Head and Neck Surgery, Emory University School of Medicine, Atlanta, GA 30332, USA.
⁵ The Winship Cancer Institute, Emory University, Atlanta, GA 30322, USA.
⁶ Imperial College Healthcare NHS Trust, London W6 8RF, UK.
⁷ Department of Otolaryngology, National Taiwan University Hospital, Taipei 100225, Taiwan.
⁸ Graduate Institute of Clinical Medicine, National Taiwan University College of Medicine, Taipei 100233, Taiwan.
⁹ CAMES-Copenhagen Academy for Medical Education and Simulation, Capital Region of Denmark, 2100 Copenhagen, Denmark.

PMID: 40804813
PMCID: PMC12346108
DOI: 10.3390/diagnostics15151848

Evaluation of Multiple-Choice Tests in Head and Neck Ultrasound Created by Physicians and Large Language Models

Jacob P S Nielsen et al. Diagnostics (Basel). 2025.

. 2025 Jul 22;15(15):1848.

doi: 10.3390/diagnostics15151848.

Authors

Jacob P S Nielsen¹, August Krogh Mikkelsen^{1

2}, Julian Kuenzel³, Merry E Sebelik^{4

5}, Gitta Madani⁶, Tsung-Lin Yang^{7

8}, Tobias Todsen^{1

2

9}

Affiliations

¹ Department of Otorhinolaryngology, Head and Neck Surgery and Audiology, Copenhagen University Hospital (Rigshospitalet), 2100 Copenhagen, Denmark.
² Department of Clinical Medicine, University of Copenhagen, 2100 Copenhagen, Denmark.
³ Department of Otorhinolaryngology, Head and Neck Surgery, University Hospital Regensburg, 93053 Regensburg, Germany.
⁴ Department of Otolaryngology-Head and Neck Surgery, Emory University School of Medicine, Atlanta, GA 30332, USA.
⁵ The Winship Cancer Institute, Emory University, Atlanta, GA 30322, USA.
⁶ Imperial College Healthcare NHS Trust, London W6 8RF, UK.
⁷ Department of Otolaryngology, National Taiwan University Hospital, Taipei 100225, Taiwan.
⁸ Graduate Institute of Clinical Medicine, National Taiwan University College of Medicine, Taipei 100233, Taiwan.
⁹ CAMES-Copenhagen Academy for Medical Education and Simulation, Capital Region of Denmark, 2100 Copenhagen, Denmark.

PMID: 40804813
PMCID: PMC12346108
DOI: 10.3390/diagnostics15151848

Abstract

Background/Objectives: Otolaryngologists are increasingly using head and neck ultrasound (HNUS). Determining whether a practitioner of HNUS has achieved adequate theoretical knowledge remains a challenge. This study assesses the performance of two large language models (LLMs) in generating multiple-choice questions (MCQs) for head and neck ultrasound, compared with MCQs generated by physicians. Methods: Physicians and LLMs (ChatGPT, GPT4o, and Google Gemini, Gemini Advanced) created a total of 90 MCQs that covered the topics of lymph nodes, thyroid, and salivary glands. Experts in HNUS additionally evaluated all physician-drafted MCQs using a Delphi-like process. The MCQs were assessed by an international panel of experts in HNUS, who were blinded to the source of the questions. Using a Likert scale, the evaluation was based on an overall assessment including six assessment criteria: clarity, relevance, suitability, quality of distractors, adequate rationale of the answer, and an assessment of the level of difficulty. Results: Four experts in the clinical field of HNUS assessed the 90 MCQs. No significant differences were observed between the two LLMs. Physician-drafted questions (n = 30) had significant differences with Google Gemini in terms of relevance, suitability, and adequate rationale of the answer, but only significant differences in terms of suitability compared with ChatGPT. Compared to MCQ items (n = 16) validated by medical experts, LLM-constructed MCQ items scored significantly lower across all criteria. The difficulty level of the MCQs was the same. Conclusions: Our study demonstrates that both LLMs could be used to generate MCQ items with a quality comparable to drafts from physicians. However, the quality of LLM-generated MCQ items was still significantly lower than MCQs validated by ultrasound experts. LLMs are therefore cost-effective to generate a quick draft for MCQ items that afterward should be validated by experts before being used for assessment purposes. In this way, the value of LLM is not the elimination of humans, but rather vastly superior time management.

Keywords: AI; LLM; head and neck; learning; multiple-choice quiz; ultrasound.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflicts of interest.

Figures

**Figure 2**
Boxplot illustrating the overall median, interquartile range, minimum, and maximum values for the assessment criteria across the evaluated entities: Google Gemini, ChatGPT, and Physicians Draft.

See this image and copyright information in PMC

References

1. Warm J.J., Melchiors J., Kristensen T.T., Aabenhus K., Charabi B.W., Eberhard K., Konge L., von Buchwald C., Todsen T. Head and neck ultrasound training improves the diagnostic performance of otolaryngology residents. Laryngoscope Investig. Otolaryngol. 2024;9:e1201. doi: 10.1002/lio2.1201. - DOI - PMC - PubMed
1. Garset-Zamani M., Lomholt A.F., Charabi B.W., Norling R., Dejanovic D., Hall J.M., Makouei F., Agander T.K., Ersbøll A.K., von Buchwald C., et al. Surgeon-performed intraoperative transoral ultrasound improves the detection of human papillomavirus-positive head and neck cancers of unknown primary. Oral Oncol. 2024;159:107073. doi: 10.1016/j.oraloncology.2024.107073. - DOI - PubMed
1. Kaltoft M., Hahn C.H., Wessman M., Hansen M.L., Agander T.K., Makouei F., Wessel I., Todsen T. Intraoral Ultrasound versus MRI for Depth of Invasion Measurement in Oral Tongue Squamous Cell Carcinoma: A Prospective Diagnostic Accuracy Study. Cancers. 2024;16:637. doi: 10.3390/cancers16030637. - DOI - PMC - PubMed
1. Todsen T., Konge L., Lind Jensen M., Ringsted C., Grantcharov T., Guldbrand Nielsen D., Bo Svendsen L. Surgeon-performed ultrasonography Collecting validity evidence for assessment of abdominal and head & neck ultrasonography skills. Dan. Med. J. 2017;64:11. - PubMed
1. Todsen T., Ewertsen C., Jenssen C., Evans R., Kuenzel J. Head and Neck Ultrasound—EFSUMB Training Recommendations for the Practice of Medical Ultrasound in Europe. Ultrasound Int. Open. 2022;8:E29–E34. doi: 10.1055/a-1922-6778. - DOI - PMC - PubMed

LinkOut - more resources

Full Text Sources
- MDPI
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Evaluation of Multiple-Choice Tests in Head and Neck Ultrasound Created by Physicians and Large Language Models

Affiliations

Evaluation of Multiple-Choice Tests in Head and Neck Ultrasound Created by Physicians and Large Language Models

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

References

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

References

Related information

LinkOut - more resources

Full Text Sources