Large language model-generated clinical practice guideline for appendicitis
- PMID: 40251310
- DOI: 10.1007/s00464-025-11723-3
Large language model-generated clinical practice guideline for appendicitis
Abstract
Background: Clinical practice guidelines provide important evidence-based recommendations to optimize patient care, but their development is labor-intensive and time-consuming. Large language models have shown promise in supporting academic writing and the development of systematic reviews, but their ability to assist with guideline development has not been explored. In this study, we tested the capacity of LLMs to support each stage of guideline development, using the latest SAGES guideline on the surgical management of appendicitis as a comparison.
Methods: Prompts were engineered to trigger LLMs to perform each task of guideline development, using key questions and PICOs derived from the SAGES guideline. ChatGPT-4, Google Gemini, Consensus, and Perplexity were queried on February 21, 2024. LLM performance was evaluated qualitatively, with narrative descriptions of each task's output. The Appraisal of Guidelines for Research and Evaluation in Surgery (AGREE-S) instrument was used to quantitatively assess the quality of the LLM-derived guideline compared to the existing SAGES guideline.
Results: Popular LLMs were able to generate a search syntax, perform data analysis, and follow the GRADE approach and Evidence-to-Decision framework to produce guideline recommendations. These LLMs were unable to independently perform a systematic literature search or reliably perform screening, data extraction, or risk of bias assessment at the time of testing. AGREE-S appraisal produced a total score of 119 for the LLM-derived guideline and 156 for the SAGES guideline. In 19 of the 24 domains, the two guidelines scored within two points of each other.
Conclusions: LLMs demonstrate potential to assist with certain steps of guideline development, which may reduce time and resource burden associated with these tasks. As new models are developed, the role for LLMs in guideline development will continue to evolve. Ongoing research and multidisciplinary collaboration are needed to support the safe and effective integration of LLMs in each step of guideline development.
Keywords: Appendicitis; ChatGPT; Clinical practice guideline; Generative AI; Large language models; Surgery.
© 2025. The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
Conflict of interest statement
Declarations. Disclosures: Dr. Bethany J. Slater is a consultant for Hologic and is the Chair of the Guidelines Committee for Society of American Gastrointestinal and Endoscopic Surgeons (SAGES). Dr. Patricia Sylla is a consultant for Ethicon, Stryker, Safeheal and Tissium. Dr. Danielle S. Walsh is a Member of the American College of Surgeons Health Information Technology Committee and Board of Governors. Dr. Danielle S. Walsh is a Member of the American Academy of Pediatrics Surgical Section Executive Committee. Ms. Amy Boyle, Dr. Bright Huo, Dr. Elisa Calabrese, Dr. Sunjay Kumar, and Dr. Wesley Vosburg have no conflicts of interest to disclose. Ethical approval: Not applicable. Patient consent statement: Not applicable.
References
-
- Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW (2023) Large language models in medicine. Nat Med. https://doi.org/10.1038/s41591-023-02448-8 - DOI - PubMed
-
- Meyer JG, Urbanowicz RJ, Martin PCN, O’Connor K, Li R, Peng PC, Bright TJ, Tatonetti N, Won KJ, Gonzalez-Hernandez G, Moore JH (2023) ChatGPT and large language models in academia: opportunities and challenges. BioData Min. https://doi.org/10.1186/s13040-023-00339-9 - DOI - PubMed - PMC
-
- Liu H, Azam M, Bin Naeem S, Faiola A (2023) An overview of the capabilities of ChatGPT for medical writing and its implications for academic integrity. Health Info Libr J. https://doi.org/10.1111/hir.12509 - DOI - PubMed
-
- Fabiano N, Gupta A, Bhambra N, Luu B, Wong S, Maaz M, Fiedorowicz JG, Smith AL, Solmi M (2024) How to optimize the systematic review process using AI tools. JCPP Adv. https://doi.org/10.1002/jcv2.12234 - DOI - PubMed - PMC
-
- Dennstädt F, Zink J, Putora PM, Hastings J, Cihoric N (2024) Title and abstract screening for literature reviews using large language models: an exploratory study in the biomedical domain. Syst Rev. https://doi.org/10.1186/s13643-024-02575-4 - DOI - PubMed - PMC
MeSH terms
LinkOut - more resources
Full Text Sources
Medical
Research Materials