Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jun;39(6):3539-3551.
doi: 10.1007/s00464-025-11723-3. Epub 2025 Apr 18.

Large language model-generated clinical practice guideline for appendicitis

Affiliations

Large language model-generated clinical practice guideline for appendicitis

Amy Boyle et al. Surg Endosc. 2025 Jun.

Abstract

Background: Clinical practice guidelines provide important evidence-based recommendations to optimize patient care, but their development is labor-intensive and time-consuming. Large language models have shown promise in supporting academic writing and the development of systematic reviews, but their ability to assist with guideline development has not been explored. In this study, we tested the capacity of LLMs to support each stage of guideline development, using the latest SAGES guideline on the surgical management of appendicitis as a comparison.

Methods: Prompts were engineered to trigger LLMs to perform each task of guideline development, using key questions and PICOs derived from the SAGES guideline. ChatGPT-4, Google Gemini, Consensus, and Perplexity were queried on February 21, 2024. LLM performance was evaluated qualitatively, with narrative descriptions of each task's output. The Appraisal of Guidelines for Research and Evaluation in Surgery (AGREE-S) instrument was used to quantitatively assess the quality of the LLM-derived guideline compared to the existing SAGES guideline.

Results: Popular LLMs were able to generate a search syntax, perform data analysis, and follow the GRADE approach and Evidence-to-Decision framework to produce guideline recommendations. These LLMs were unable to independently perform a systematic literature search or reliably perform screening, data extraction, or risk of bias assessment at the time of testing. AGREE-S appraisal produced a total score of 119 for the LLM-derived guideline and 156 for the SAGES guideline. In 19 of the 24 domains, the two guidelines scored within two points of each other.

Conclusions: LLMs demonstrate potential to assist with certain steps of guideline development, which may reduce time and resource burden associated with these tasks. As new models are developed, the role for LLMs in guideline development will continue to evolve. Ongoing research and multidisciplinary collaboration are needed to support the safe and effective integration of LLMs in each step of guideline development.

Keywords: Appendicitis; ChatGPT; Clinical practice guideline; Generative AI; Large language models; Surgery.

PubMed Disclaimer

Conflict of interest statement

Declarations. Disclosures: Dr. Bethany J. Slater is a consultant for Hologic and is the Chair of the Guidelines Committee for Society of American Gastrointestinal and Endoscopic Surgeons (SAGES). Dr. Patricia Sylla is a consultant for Ethicon, Stryker, Safeheal and Tissium. Dr. Danielle S. Walsh is a Member of the American College of Surgeons Health Information Technology Committee and Board of Governors. Dr. Danielle S. Walsh is a Member of the American Academy of Pediatrics Surgical Section Executive Committee. Ms. Amy Boyle, Dr. Bright Huo, Dr. Elisa Calabrese, Dr. Sunjay Kumar, and Dr. Wesley Vosburg have no conflicts of interest to disclose. Ethical approval: Not applicable. Patient consent statement: Not applicable.

References

    1. Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW (2023) Large language models in medicine. Nat Med. https://doi.org/10.1038/s41591-023-02448-8 - DOI - PubMed
    1. Meyer JG, Urbanowicz RJ, Martin PCN, O’Connor K, Li R, Peng PC, Bright TJ, Tatonetti N, Won KJ, Gonzalez-Hernandez G, Moore JH (2023) ChatGPT and large language models in academia: opportunities and challenges. BioData Min. https://doi.org/10.1186/s13040-023-00339-9 - DOI - PubMed - PMC
    1. Liu H, Azam M, Bin Naeem S, Faiola A (2023) An overview of the capabilities of ChatGPT for medical writing and its implications for academic integrity. Health Info Libr J. https://doi.org/10.1111/hir.12509 - DOI - PubMed
    1. Fabiano N, Gupta A, Bhambra N, Luu B, Wong S, Maaz M, Fiedorowicz JG, Smith AL, Solmi M (2024) How to optimize the systematic review process using AI tools. JCPP Adv. https://doi.org/10.1002/jcv2.12234 - DOI - PubMed - PMC
    1. Dennstädt F, Zink J, Putora PM, Hastings J, Cihoric N (2024) Title and abstract screening for literature reviews using large language models: an exploratory study in the biomedical domain. Syst Rev. https://doi.org/10.1186/s13643-024-02575-4 - DOI - PubMed - PMC

LinkOut - more resources