. 2025 Jul 22;28(1):e301762.

doi: 10.1136/bmjment-2025-301762.

Development and evaluation of prompts for a large language model to screen titles and abstracts in a living systematic review

Ava Homiar^{1

2}, James Thomas³, Edoardo G Ostinelli⁴, Jaycee Kennett⁴, Claire Friedrich⁴, Pim Cuijpers⁵, Mathias Harrer⁶, Stefan Leucht⁷, Clara Miguel⁸, Alessandro Rodolico⁹, Yuki Kataoka¹⁰, Tomohiro Takayama¹¹, Keisuke Yoshimura¹¹, Ryuhei So¹², Yasushi Tsujimoto¹¹, Yosuke Yamagishi¹³, Shiro Takagi¹⁴, Masatsugu Sakata¹⁵, Đorđe Bašić⁵, Eirini Karyotaki⁸, Jennifer Potts¹⁶, Georgia Salanti¹⁷, Toshi A Furukawa¹⁸, Andrea Cipriani⁴

Affiliations

¹ Department of Psychiatry, University of Oxford, Oxford, UK avahomiar@gmail.com.
² Division of Clinical Informatics, Beth Israel Deaconess Medical Center, Boston, Massachusetts, USA.
³ University College London, London, UK.
⁴ Department of Psychiatry, University of Oxford, Oxford, UK.
⁵ Vrije Universiteit Amsterdam, Amsterdam, The Netherlands.
⁶ Psychology & Digital Mental Health Care, Technical University of Munich, Munchen, Germany.
⁷ Psychiatry and Psychotherapy, Technical University of Munich School of Medicine, Munich, Germany.
⁸ Department of Clinical, Neuro and Developmental Psychology, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands.
⁹ Technical University of Munich, Munich, Germany.
¹⁰ Kyoto Min-iren Asukai Hospital, Kyoto, Japan.
¹¹ Kyoto University, Kyoto, Japan.
¹² Okayama Psychiatric Medical Center, Okayama, Japan.
¹³ The University of Tokyo, Bunkyo, Tokyo, Japan.
¹⁴ Independent Researcher, Kyoto, Japan.
¹⁵ Nagoya City University Graduate School of Medical Sciences and Medical School, Nagoya, Japan.
¹⁶ NIHR Oxford Health Biomedical Research Centre, Oxford, UK.
¹⁷ University of Bern, Bern, Switzerland.
¹⁸ Kyoto University Graduate School of Medicine Faculty of Medicine, Kyoto, Japan.

PMID: 40701625
PMCID: PMC12306261
DOI: 10.1136/bmjment-2025-301762

Development and evaluation of prompts for a large language model to screen titles and abstracts in a living systematic review

Ava Homiar et al. BMJ Ment Health. 2025.

. 2025 Jul 22;28(1):e301762.

doi: 10.1136/bmjment-2025-301762.

Authors

Affiliations

¹ Department of Psychiatry, University of Oxford, Oxford, UK avahomiar@gmail.com.
² Division of Clinical Informatics, Beth Israel Deaconess Medical Center, Boston, Massachusetts, USA.
³ University College London, London, UK.
⁴ Department of Psychiatry, University of Oxford, Oxford, UK.
⁵ Vrije Universiteit Amsterdam, Amsterdam, The Netherlands.
⁶ Psychology & Digital Mental Health Care, Technical University of Munich, Munchen, Germany.
⁷ Psychiatry and Psychotherapy, Technical University of Munich School of Medicine, Munich, Germany.
⁸ Department of Clinical, Neuro and Developmental Psychology, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands.
⁹ Technical University of Munich, Munich, Germany.
¹⁰ Kyoto Min-iren Asukai Hospital, Kyoto, Japan.
¹¹ Kyoto University, Kyoto, Japan.
¹² Okayama Psychiatric Medical Center, Okayama, Japan.
¹³ The University of Tokyo, Bunkyo, Tokyo, Japan.
¹⁴ Independent Researcher, Kyoto, Japan.
¹⁵ Nagoya City University Graduate School of Medical Sciences and Medical School, Nagoya, Japan.
¹⁶ NIHR Oxford Health Biomedical Research Centre, Oxford, UK.
¹⁷ University of Bern, Bern, Switzerland.
¹⁸ Kyoto University Graduate School of Medicine Faculty of Medicine, Kyoto, Japan.

PMID: 40701625
PMCID: PMC12306261
DOI: 10.1136/bmjment-2025-301762

Abstract

Background: Living systematic reviews (LSRs) maintain an updated summary of evidence by incorporating newly published research. While they improve review currency, repeated screening and selection of new references make them labourious and difficult to maintain. Large language models (LLMs) show promise in assisting with screening and data extraction, but more work is needed to achieve the high accuracy required for evidence that informs clinical and policy decisions.

Objective: The study evaluated the effectiveness of an LLM (GPT-4o) in title and abstract screening compared with human reviewers.

Methods: Human decisions from an LSR on prodopaminergic interventions for anhedonia served as the reference standard. The baseline search results were divided into a development and a test set. Prompts guiding the LLM's eligibility assessments were refined using the development set and evaluated on the test set and two subsequent LSR updates. Consistency of the LLM outputs was also assessed.

Results: Prompt development required 1045 records. When applied to the remaining baseline 11 939 records and two updates, the refined prompts achieved 100% sensitivity for studies ultimately included in the review after full-text screening, though sensitivity for records included by humans at the title and abstract stage varied (58-100%) across updates. Simulated workload reductions of 65-85% were observed. Prompt decisions showed high consistency, with minimal false exclusions, satisfying established screening performance benchmarks for systematic reviews.

Conclusions: Refined GPT-4o prompts demonstrated high sensitivity and moderate specificity while reducing human workload. This approach shows potential for integrating LLMs into systematic review workflows to enhance efficiency.

Keywords: Data Interpretation, Statistical; Machine Learning; PSYCHIATRY.

PubMed Disclaimer

Conflict of interest statement

Competing interests: AH, JT, JK, CF, PC, CM, AR, YK, KY, YY, ST, ĐB, EK, JP and GS: None. EGO received research and consultancy fees from Angelini Pharma. MH is a part-time employee of Get. On Institut GmbH/HelloBetter, a company that implements digital therapeutics into routine care. SL in the last three years has received honoraria for advising/consulting and/or for lectures and/or for educational material from Angelini, Apsen, Boehringer Ingelheim, Eisai, Ekademia, GedeonRichter, Janssen, Karuna, Kynexis, Lundbeck, Medichem, Medscape, Mitsubishi, Neurotorium, Otsuka, NovoNordisk, Recordati, Rovi and Teva. TT is a part-time employee of Fitting Cloud, outside of the submitted work. RS is an employee of CureApp. RS reports grants from Osake-no-Kagaku Foundation, the Mental Health Okamoto Memorial Foundation, Kobayashi Magobe Memorial Medical Foundation, personal fees from Otsuka Pharmaceutical, Nippon Shinyaku, Takeda Pharmaceutical and Sumitomo Pharma outside this work; In addition, RS has a patent JP2022049590A, US20220084673A1 pending, a patent JP2022178215A pending, a patent JP2022070086 pending and a patent JP2023074128A pending. YT reports grants from Japan Society for the Promotion of Science, Kyoto University and Pfizer Foundation, outside of the submitted work. In addition, YT is a board member of Cochrane Japan and works as a physician at Oku Medical Clinic. MS is employed in the department of neurodevelopmental disorders, Nagoya City University Graduate School of Medicine, which is an endowment department supported by the City of Nagoya and has received a personal fee from SONY outside of the submitted work. TAF reports personal fees from Boehringer-Ingelheim, Daiichi Sankyo, DT Axis, Micron, Shionogi, SONY and UpToDate, and a grant from DT Axis and Shionogi, outside of the submitted work. In addition, TAF has a patent 7448125 and a pending patent 2022-082495 and has licensed intellectual properties for Kokoro-app to DT Axis. AC has received research, educational and consultancy fees from the Italian Network for Paediatric Trials, CARIPLO Foundation, Lundbeck and Angelini Pharma, outside of the submitted work.

Figures

**Figure 1. Creation of development and test sets of records in the baseline review.**

See this image and copyright information in PMC

References

1. Elliott JH, Turner T, Clavisi O, et al. Living systematic reviews: an emerging opportunity to narrow the evidence-practice gap. PLoS Med. 2014;11:e1001603. doi: 10.1371/journal.pmed.1001603. - DOI - PMC - PubMed
1. Moher D, Liberati A, Tetzlaff J, et al. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. Ann Intern Med. 2009;151:264–9. doi: 10.7326/0003-4819-151-4-200908180-00135. - DOI - PubMed
1. Borah R, Brown AW, Capers PL, et al. Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ Open. 2017;7:e012545. doi: 10.1136/bmjopen-2016-012545. - DOI - PMC - PubMed
1. Simmonds M, Elliott JH, Synnot A, et al. Living Systematic Reviews. Methods Mol Biol. 2022;2345:121–34. doi: 10.1007/978-1-0716-1566-9_7. - DOI - PubMed
1. Cowie K, Rahmatullah A, Hardy N, et al. Web-Based Software Tools for Systematic Literature Review in Medicine: Systematic Search and Feature Analysis. JMIR Med Inform. 2022;10:e33219. doi: 10.2196/33219. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

WT_/Wellcome Trust/United Kingdom

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Development and evaluation of prompts for a large language model to screen titles and abstracts in a living systematic review

Affiliations

Development and evaluation of prompts for a large language model to screen titles and abstracts in a living systematic review

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources