Development of Prompt Templates for Large Language Model-Driven Screening in Systematic Reviews

Christian Cao¹, Jason Sang², Rohit Arora³, David Chen⁴, Robert Kloosterman⁴, Matthew Cecere⁴, Jaswanth Gorla⁴, Richard Saleh⁴, Ian Drennan⁵, Bijan Teja⁶, Michael Fehlings⁷, Paul Ronksley⁸, Alexander A Leung⁹, Dany E Weisz¹⁰, Harriet Ware¹¹, Mairead Whelan¹¹, David B Emerson¹², Rahul K Arora¹¹, Niklas Bobrovitz¹³

Affiliations

¹ Temerty Faculty of Medicine, University of Toronto, Toronto, Ontario, and Centre for Health Informatics, Department of Community Health Sciences, University of Calgary, Calgary, Alberta, Canada (C.C.).
² Stripe, San Francisco, California (J.S.).
³ Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts (R.A.).
⁴ Temerty Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada (D.C., R.K., M.C., J.G., R.S.).
⁵ Temerty Faculty of Medicine, University of Toronto, Department of Emergency Services and Sunnybrook Research Institute, Sunnybrook Health Sciences Centre, and Ornge Air Ambulance and Critical Care Transport, Toronto, Ontario, Canada (I.D.).
⁶ Department of Anesthesiology and Pain Medicine, University of Toronto, and Department of Anesthesia and Critical Care Medicine, St. Michael's Hospital, Toronto, Ontario, Canada (B.T.).
⁷ Department of Surgery, Temerty Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada (M.F.).
⁸ Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada (P.R.).
⁹ Department of Medicine and Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada (A.A.L.).
¹⁰ Department of Newborn and Developmental Paediatrics, Sunnybrook Health Sciences Centre, Toronto, Ontario, Canada (D.E.W.).
¹¹ Centre for Health Informatics, Department of Community Health Sciences, University of Calgary, Calgary, Alberta, Canada (H.W., M.W., R.K.A.).
¹² Vector Institute, Toronto, Ontario, Canada (D.B.E.).
¹³ Centre for Health Informatics, Department of Community Health Sciences, and Department of Emergency Medicine, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada (N.B.).

PMID: 39993313
DOI: 10.7326/ANNALS-24-02189

Development of Prompt Templates for Large Language Model-Driven Screening in Systematic Reviews

Christian Cao et al. Ann Intern Med. 2025 Mar.

. 2025 Mar;178(3):389-401.

doi: 10.7326/ANNALS-24-02189. Epub 2025 Feb 25.

Authors

Affiliations

¹ Temerty Faculty of Medicine, University of Toronto, Toronto, Ontario, and Centre for Health Informatics, Department of Community Health Sciences, University of Calgary, Calgary, Alberta, Canada (C.C.).
² Stripe, San Francisco, California (J.S.).
³ Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts (R.A.).
⁴ Temerty Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada (D.C., R.K., M.C., J.G., R.S.).
⁵ Temerty Faculty of Medicine, University of Toronto, Department of Emergency Services and Sunnybrook Research Institute, Sunnybrook Health Sciences Centre, and Ornge Air Ambulance and Critical Care Transport, Toronto, Ontario, Canada (I.D.).
⁶ Department of Anesthesiology and Pain Medicine, University of Toronto, and Department of Anesthesia and Critical Care Medicine, St. Michael's Hospital, Toronto, Ontario, Canada (B.T.).
⁷ Department of Surgery, Temerty Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada (M.F.).
⁸ Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada (P.R.).
⁹ Department of Medicine and Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada (A.A.L.).
¹⁰ Department of Newborn and Developmental Paediatrics, Sunnybrook Health Sciences Centre, Toronto, Ontario, Canada (D.E.W.).
¹¹ Centre for Health Informatics, Department of Community Health Sciences, University of Calgary, Calgary, Alberta, Canada (H.W., M.W., R.K.A.).
¹² Vector Institute, Toronto, Ontario, Canada (D.B.E.).
¹³ Centre for Health Informatics, Department of Community Health Sciences, and Department of Emergency Medicine, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada (N.B.).

PMID: 39993313
DOI: 10.7326/ANNALS-24-02189

Abstract

Background: Systematic reviews (SRs) are hindered by the initial rigorous article screen, which delays access to reliable information synthesis.

Objective: To develop generic prompt templates for large language model (LLM)-driven abstract and full-text screening that can be adapted to different reviews.

Design: Diagnostic test accuracy.

Setting: 48 425 citations were tested for abstract screening across 10 SRs. Full-text screening evaluated all 12 690 freely available articles from the original search. Prompt development used the GPT4-0125-preview model (OpenAI).

Participants: None.

Measurements: Large language models were prompted to include or exclude articles based on SR eligibility criteria. Model outputs were compared with original SR author decisions after full-text screening to evaluate performance (accuracy, sensitivity, and specificity).

Results: Optimized prompts using GPT4-0125-preview achieved a weighted sensitivity of 97.7% (range, 86.7% to 100%) and specificity of 85.2% (range, 68.3% to 95.9%) in abstract screening and weighted sensitivity of 96.5% (range, 89.7% to 100.0%) and specificity of 91.2% (range, 80.7% to 100%) in full-text screening across 10 SRs. In contrast, zero-shot prompts had poor sensitivity (49.0% abstract, 49.1% full-text). Across LLMs, Claude-3.5 (Anthropic) and GPT4 variants had similar performance, whereas Gemini Pro (Google) and GPT3.5 (OpenAI) models underperformed. Direct screening costs for 10 000 citations differed substantially: Where single human abstract screening was estimated to require more than 83 hours and $1666.67 USD, our LLM-based approach completed screening in under 1 day for $157.02 USD.

Limitations: Further prompt optimizations may exist. Retrospective study. Convenience sample of SRs. Full-text screening evaluations were limited to free PubMed Central full-text articles.

Conclusion: A generic prompt for abstract and full-text screening achieving high sensitivity and specificity that can be adapted to other SRs and LLMs was developed. Our prompting innovations may have value to SR investigators and researchers conducting similar criteria-based tasks across the medical sciences.

Primary funding source: None.

PubMed Disclaimer

Conflict of interest statement

Disclosures: Disclosure forms are available with the article online.

MeSH terms

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Atypon
- Ovid Technologies, Inc.
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Development of Prompt Templates for Large Language Model-Driven Screening in Systematic Reviews

Affiliations

Development of Prompt Templates for Large Language Model-Driven Screening in Systematic Reviews

Authors

Affiliations

Abstract

Conflict of interest statement

MeSH terms

LinkOut - more resources

Full Text Sources

Research Materials