To trust or not to trust: evaluating the reliability and safety of AI responses to laryngeal cancer queries
- PMID: 38652298
- PMCID: PMC11512842
- DOI: 10.1007/s00405-024-08643-8
To trust or not to trust: evaluating the reliability and safety of AI responses to laryngeal cancer queries
Abstract
Purpose: As online health information-seeking surges, concerns mount over the quality and safety of accessible content, potentially leading to patient harm through misinformation. On one hand, the emergence of Artificial Intelligence (AI) in healthcare could prevent it; on the other hand, questions raise regarding the quality and safety of the medical information provided. As laryngeal cancer is a prevalent head and neck malignancy, this study aims to evaluate the utility and safety of three large language models (LLMs) as sources of patient information about laryngeal cancer.
Methods: A cross-sectional study was conducted using three LLMs (ChatGPT 3.5, ChatGPT 4.0, and Bard). A questionnaire comprising 36 inquiries about laryngeal cancer was categorised into diagnosis (11 questions), treatment (9 questions), novelties and upcoming treatments (4 questions), controversies (8 questions), and sources of information (4 questions). The population of reviewers consisted of 3 groups, including ENT specialists, junior physicians, and non-medicals, who graded the responses. Each physician evaluated each question twice for each model, while non-medicals only once. Everyone was blinded to the model type, and the question order was shuffled. Outcome evaluations were based on a safety score (1-3) and a Global Quality Score (GQS, 1-5). Results were compared between LLMs. The study included iterative assessments and statistical validations.
Results: Analysis revealed that ChatGPT 3.5 scored highest in both safety (mean: 2.70) and GQS (mean: 3.95). ChatGPT 4.0 and Bard had lower safety scores of 2.56 and 2.42, respectively, with corresponding quality scores of 3.65 and 3.38. Inter-rater reliability was consistent, with less than 3% discrepancy. About 4.2% of responses fell into the lowest safety category (1), particularly in the novelty category. Non-medical reviewers' quality assessments correlated moderately (r = 0.67) with response length.
Conclusions: LLMs can be valuable resources for patients seeking information on laryngeal cancer. ChatGPT 3.5 provided the most reliable and safe responses among the models evaluated.
Keywords: Artificial intelligence; Bard; ChatGPT; Laryngeal cancer; Oncology; Patient education.
© 2024. The Author(s).
Conflict of interest statement
The authors have no conflict of interest.
Figures




Similar articles
-
Chat Generative Pretrained Transformer (ChatGPT) and Bard: Artificial Intelligence Does not yet Provide Clinically Supported Answers for Hip and Knee Osteoarthritis.J Arthroplasty. 2024 May;39(5):1184-1190. doi: 10.1016/j.arth.2024.01.029. Epub 2024 Jan 17. J Arthroplasty. 2024. PMID: 38237878
-
Dr. Google to Dr. ChatGPT: assessing the content and quality of artificial intelligence-generated medical information on appendicitis.Surg Endosc. 2024 May;38(5):2887-2893. doi: 10.1007/s00464-024-10739-5. Epub 2024 Mar 5. Surg Endosc. 2024. PMID: 38443499 Free PMC article.
-
Evaluation of the Current Status of Artificial Intelligence for Endourology Patient Education: A Blind Comparison of ChatGPT and Google Bard Against Traditional Information Resources.J Endourol. 2024 Aug;38(8):843-851. doi: 10.1089/end.2023.0696. Epub 2024 May 17. J Endourol. 2024. PMID: 38441078
-
Exploring the role of artificial intelligence, large language models: Comparing patient-focused information and clinical decision support capabilities to the gynecologic oncology guidelines.Int J Gynaecol Obstet. 2025 Feb;168(2):419-427. doi: 10.1002/ijgo.15869. Epub 2024 Aug 20. Int J Gynaecol Obstet. 2025. PMID: 39161265 Free PMC article. Review.
-
Evaluation of ChatGPT-generated medical responses: A systematic review and meta-analysis.J Biomed Inform. 2024 Mar;151:104620. doi: 10.1016/j.jbi.2024.104620. Epub 2024 Mar 8. J Biomed Inform. 2024. PMID: 38462064
Cited by
-
Can Large Language Models Aid Caregivers of Pediatric Cancer Patients in Information Seeking? A Cross-Sectional Investigation.Cancer Med. 2025 Jan;14(1):e70554. doi: 10.1002/cam4.70554. Cancer Med. 2025. PMID: 39776222 Free PMC article.
-
Digitalizing informed consent in healthcare: a scoping review.BMC Health Serv Res. 2025 Jul 2;25(1):893. doi: 10.1186/s12913-025-12964-7. BMC Health Serv Res. 2025. PMID: 40604763 Free PMC article.
-
The role of ChatGPT-4o in differential diagnosis and management of vertigo-related disorders.Sci Rep. 2025 May 28;15(1):18688. doi: 10.1038/s41598-025-96309-8. Sci Rep. 2025. PMID: 40437044 Free PMC article.
-
Medical accuracy of artificial intelligence chatbots in oncology: a scoping review.Oncologist. 2025 Apr 4;30(4):oyaf038. doi: 10.1093/oncolo/oyaf038. Oncologist. 2025. PMID: 40285677 Free PMC article.
-
Generative AI/LLMs for Plain Language Medical Information for Patients, Caregivers and General Public: Opportunities, Risks and Ethics.Patient Prefer Adherence. 2025 Jul 31;19:2227-2249. doi: 10.2147/PPA.S527922. eCollection 2025. Patient Prefer Adherence. 2025. PMID: 40771655 Free PMC article. Review.
References
-
- Bujnowska-Fedak MM, Waligóra J, Mastalerz-Migas A (2019) The internet as a source of health information and services. Adv Exp Med Biol 1211:1–16. 10.1007/5584_2019_396 - PubMed
-
- (2024) Eurostat. 10.2908/ISOC_CI_AC_I. Accessed 9 Mar 2024
-
- (2022) OpenAI. https://openai.com/chatgpt. Accessed 9 Mar 2024
MeSH terms
LinkOut - more resources
Full Text Sources
Research Materials
Miscellaneous