Large language model-generated clinical practice guideline for appendicitis

Amy Boyle¹, Bright Huo², Patricia Sylla³, Elisa Calabrese⁴, Sunjay Kumar⁵, Bethany J Slater⁶, Danielle S Walsh⁷, R Wesley Vosburg⁸

Affiliations

¹ Michael G. DeGroote School of Medicine, McMaster University, Hamilton, ON, Canada.
² Division of General Surgery, Department of Surgery, McMaster University, Hamilton, ON, Canada.
³ Division of Colon and Rectal Surgery, Department of Surgery, Mount Sinai Hospital, New York, NY, USA.
⁴ Department of Surgery, University of Adelaide, The Queen Elizabeth Hospital, Adelaide, SA, Australia.
⁵ Department of General Surgery, Thomas Jefferson University Hospital, Philadelphia, PA, USA.
⁶ Department of Surgery, University of Chicago, Chicago, IL, USA.
⁷ Professor of Surgery, Department of Surgery, University of Kentucky, Lexington, KY, USA.
⁸ Department of Surgery, Grand Strand Medical Center, Myrtle Beach, SC, USA. ralph.vosburg@hcahealthcare.com.

PMID: 40251310
DOI: 10.1007/s00464-025-11723-3

Large language model-generated clinical practice guideline for appendicitis

Amy Boyle et al. Surg Endosc. 2025 Jun.

. 2025 Jun;39(6):3539-3551.

doi: 10.1007/s00464-025-11723-3. Epub 2025 Apr 18.

Authors

Amy Boyle¹, Bright Huo², Patricia Sylla³, Elisa Calabrese⁴, Sunjay Kumar⁵, Bethany J Slater⁶, Danielle S Walsh⁷, R Wesley Vosburg⁸

Affiliations

¹ Michael G. DeGroote School of Medicine, McMaster University, Hamilton, ON, Canada.
² Division of General Surgery, Department of Surgery, McMaster University, Hamilton, ON, Canada.
³ Division of Colon and Rectal Surgery, Department of Surgery, Mount Sinai Hospital, New York, NY, USA.
⁴ Department of Surgery, University of Adelaide, The Queen Elizabeth Hospital, Adelaide, SA, Australia.
⁵ Department of General Surgery, Thomas Jefferson University Hospital, Philadelphia, PA, USA.
⁶ Department of Surgery, University of Chicago, Chicago, IL, USA.
⁷ Professor of Surgery, Department of Surgery, University of Kentucky, Lexington, KY, USA.
⁸ Department of Surgery, Grand Strand Medical Center, Myrtle Beach, SC, USA. ralph.vosburg@hcahealthcare.com.

PMID: 40251310
DOI: 10.1007/s00464-025-11723-3

Abstract

Background: Clinical practice guidelines provide important evidence-based recommendations to optimize patient care, but their development is labor-intensive and time-consuming. Large language models have shown promise in supporting academic writing and the development of systematic reviews, but their ability to assist with guideline development has not been explored. In this study, we tested the capacity of LLMs to support each stage of guideline development, using the latest SAGES guideline on the surgical management of appendicitis as a comparison.

Methods: Prompts were engineered to trigger LLMs to perform each task of guideline development, using key questions and PICOs derived from the SAGES guideline. ChatGPT-4, Google Gemini, Consensus, and Perplexity were queried on February 21, 2024. LLM performance was evaluated qualitatively, with narrative descriptions of each task's output. The Appraisal of Guidelines for Research and Evaluation in Surgery (AGREE-S) instrument was used to quantitatively assess the quality of the LLM-derived guideline compared to the existing SAGES guideline.

Results: Popular LLMs were able to generate a search syntax, perform data analysis, and follow the GRADE approach and Evidence-to-Decision framework to produce guideline recommendations. These LLMs were unable to independently perform a systematic literature search or reliably perform screening, data extraction, or risk of bias assessment at the time of testing. AGREE-S appraisal produced a total score of 119 for the LLM-derived guideline and 156 for the SAGES guideline. In 19 of the 24 domains, the two guidelines scored within two points of each other.

Conclusions: LLMs demonstrate potential to assist with certain steps of guideline development, which may reduce time and resource burden associated with these tasks. As new models are developed, the role for LLMs in guideline development will continue to evolve. Ongoing research and multidisciplinary collaboration are needed to support the safe and effective integration of LLMs in each step of guideline development.

Keywords: Appendicitis; ChatGPT; Clinical practice guideline; Generative AI; Large language models; Surgery.

PubMed Disclaimer

Conflict of interest statement

Declarations. Disclosures: Dr. Bethany J. Slater is a consultant for Hologic and is the Chair of the Guidelines Committee for Society of American Gastrointestinal and Endoscopic Surgeons (SAGES). Dr. Patricia Sylla is a consultant for Ethicon, Stryker, Safeheal and Tissium. Dr. Danielle S. Walsh is a Member of the American College of Surgeons Health Information Technology Committee and Board of Governors. Dr. Danielle S. Walsh is a Member of the American Academy of Pediatrics Surgical Section Executive Committee. Ms. Amy Boyle, Dr. Bright Huo, Dr. Elisa Calabrese, Dr. Sunjay Kumar, and Dr. Wesley Vosburg have no conflicts of interest to disclose. Ethical approval: Not applicable. Patient consent statement: Not applicable.

References

1. Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW (2023) Large language models in medicine. Nat Med. https://doi.org/10.1038/s41591-023-02448-8 - DOI - PubMed
1. Meyer JG, Urbanowicz RJ, Martin PCN, O’Connor K, Li R, Peng PC, Bright TJ, Tatonetti N, Won KJ, Gonzalez-Hernandez G, Moore JH (2023) ChatGPT and large language models in academia: opportunities and challenges. BioData Min. https://doi.org/10.1186/s13040-023-00339-9 - DOI - PubMed - PMC
1. Liu H, Azam M, Bin Naeem S, Faiola A (2023) An overview of the capabilities of ChatGPT for medical writing and its implications for academic integrity. Health Info Libr J. https://doi.org/10.1111/hir.12509 - DOI - PubMed
1. Fabiano N, Gupta A, Bhambra N, Luu B, Wong S, Maaz M, Fiedorowicz JG, Smith AL, Solmi M (2024) How to optimize the systematic review process using AI tools. JCPP Adv. https://doi.org/10.1002/jcv2.12234 - DOI - PubMed - PMC
1. Dennstädt F, Zink J, Putora PM, Hastings J, Cihoric N (2024) Title and abstract screening for literature reviews using large language models: an exploratory study in the biomedical domain. Syst Rev. https://doi.org/10.1186/s13643-024-02575-4 - DOI - PubMed - PMC

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Springer
Medical
- MedlinePlus Health Information
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Large language model-generated clinical practice guideline for appendicitis

Affiliations

Large language model-generated clinical practice guideline for appendicitis

Authors

Affiliations

Abstract

Conflict of interest statement

References

MeSH terms

LinkOut - more resources

Full Text Sources

Medical

Research Materials