Utilizing Large language models to select literature for meta-analysis shows workload reduction while maintaining a similar recall level as manual curation

doi:10.1186/s12874-025-02569-3

. 2025 Apr 28;25(1):116.

doi: 10.1186/s12874-025-02569-3.

Utilizing Large language models to select literature for meta-analysis shows workload reduction while maintaining a similar recall level as manual curation

Xiangming Cai^#¹, Yuanming Geng^#^{2

3}, Yiming Du^#⁴, Bart Westerman⁵, Duolao Wang⁶, Chiyuan Ma^{7

8

9

10

11}, Juan J Garcia Vallejo¹²

Affiliations

¹ Department of Molecular Cell Biology & Immunology, Amsterdam Infection & Immunity Institute and Cancer Center Amsterdam, Amsterdam UMC, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands. x.cai@amsterdamumc.nl.
² Department of Neurosurgery, Jinling Hospital, Nanjing, China.
³ Department of Neurosurgery, Affiliated Jingling Hospital, Nanjing Medical University, Nanjing, China.
⁴ Department of System Engineering and Engineering Management, The Chinese University of Hong Kong, Hong Kong, China.
⁵ Department of Neurosurgery, Cancer Center Amsterdam, Brain Tumor Center Amsterdam, Amsterdam UMC Location Vrije Universiteit Amsterdam, Amsterdam, The Netherlands.
⁶ Department of Clinical Sciences, Liverpool School of Tropical Medicine, Liverpool, UK. duolao.wang@lstmed.ac.uk.
⁷ Department of Neurosurgery, Jinling Hospital, Nanjing, China. machiyuan_nju@126.com.
⁸ Department of Neurosurgery, Affiliated Jingling Hospital, Nanjing Medical University, Nanjing, China. machiyuan_nju@126.com.
⁹ School of Medicine, Southeast University, Nanjing, China. machiyuan_nju@126.com.
¹⁰ Department of Neurosurgery, Affiliated Jinling Hospital, Medical School of Nanjing University, Nanjing, China. machiyuan_nju@126.com.
¹¹ Department of Neurosurgery, Jinling Hospital, the First School of Clinical Medicine, Southern Medical University, Nanjing, China. machiyuan_nju@126.com.
¹² Department of Molecular Cell Biology & Immunology, Amsterdam Infection & Immunity Institute and Cancer Center Amsterdam, Amsterdam UMC, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands.

^# Contributed equally.

PMID: 40295957
PMCID: PMC12036192
DOI: 10.1186/s12874-025-02569-3

Utilizing Large language models to select literature for meta-analysis shows workload reduction while maintaining a similar recall level as manual curation

Xiangming Cai et al. BMC Med Res Methodol. 2025.

. 2025 Apr 28;25(1):116.

doi: 10.1186/s12874-025-02569-3.

Authors

Xiangming Cai^#¹, Yuanming Geng^#^{2

3}, Yiming Du^#⁴, Bart Westerman⁵, Duolao Wang⁶, Chiyuan Ma^{7

8

9

10

11}, Juan J Garcia Vallejo¹²

Affiliations

¹ Department of Molecular Cell Biology & Immunology, Amsterdam Infection & Immunity Institute and Cancer Center Amsterdam, Amsterdam UMC, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands. x.cai@amsterdamumc.nl.
² Department of Neurosurgery, Jinling Hospital, Nanjing, China.
³ Department of Neurosurgery, Affiliated Jingling Hospital, Nanjing Medical University, Nanjing, China.
⁴ Department of System Engineering and Engineering Management, The Chinese University of Hong Kong, Hong Kong, China.
⁵ Department of Neurosurgery, Cancer Center Amsterdam, Brain Tumor Center Amsterdam, Amsterdam UMC Location Vrije Universiteit Amsterdam, Amsterdam, The Netherlands.
⁶ Department of Clinical Sciences, Liverpool School of Tropical Medicine, Liverpool, UK. duolao.wang@lstmed.ac.uk.
⁷ Department of Neurosurgery, Jinling Hospital, Nanjing, China. machiyuan_nju@126.com.
⁸ Department of Neurosurgery, Affiliated Jingling Hospital, Nanjing Medical University, Nanjing, China. machiyuan_nju@126.com.
⁹ School of Medicine, Southeast University, Nanjing, China. machiyuan_nju@126.com.
¹⁰ Department of Neurosurgery, Affiliated Jinling Hospital, Medical School of Nanjing University, Nanjing, China. machiyuan_nju@126.com.
¹¹ Department of Neurosurgery, Jinling Hospital, the First School of Clinical Medicine, Southern Medical University, Nanjing, China. machiyuan_nju@126.com.
¹² Department of Molecular Cell Biology & Immunology, Amsterdam Infection & Immunity Institute and Cancer Center Amsterdam, Amsterdam UMC, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands.

^# Contributed equally.

PMID: 40295957
PMCID: PMC12036192
DOI: 10.1186/s12874-025-02569-3

Abstract

Background: Large language models (LLMs) like ChatGPT showed great potential in aiding medical research. A heavy workload in filtering records is needed during the research process of evidence-based medicine, especially meta-analysis. However, few studies tried to use LLMs to help screen records in meta-analysis.

Objective: In this research, we aimed to explore the possibility of incorporating multiple LLMs to facilitate the screening step based on the title and abstract of records during meta-analysis.

Methods: Various LLMs were evaluated, which includes GPT-3.5, GPT-4, Deepseek-R1-Distill, Qwen-2.5, Phi-4, Llama-3.1, Gemma-2 and Claude-2. To assess our strategy, we selected three meta-analyses from the literature, together with a glioma meta-analysis embedded in the study, as additional validation. For the automatic selection of records from curated meta-analyses, a four-step strategy called LARS-GPT was developed, consisting of (1) criteria selection and single-prompt (prompt with one criterion) creation, (2) best combination identification, (3) combined-prompt (prompt with one or more criteria) creation, and (4) request sending and answer summary. Recall, workload reduction, precision, and F1 score were calculated to assess the performance of LARS-GPT.

Results: A variable performance was found between different single-prompts, with a mean recall of 0.800. Based on these single-prompts, we were able to find combinations with better performance than the pre-set threshold. Finally, with a best combination of criteria identified, LARS-GPT showed a 40.1% workload reduction on average with a recall greater than 0.9.

Conclusions: We show here the groundbreaking finding that automatic selection of literature for meta-analysis is possible with LLMs. We provide it here as a pipeline, LARS-GPT, which showed a great workload reduction while maintaining a pre-set recall.

Keywords: ChatGPT; Deepseek; Large language model; Meta-analysis; Phi.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: Not applicable. Consent for publication: Not applicable. Competing interests: The authors declare no competing interests.

Figures

**Fig. 1**
Schematic illustration of the LARS-GPT pipeline. Single-prompt represents a prompt with only one criterion. Combined-prompt stands for prompt with more than one criterion. Color of labels: single-prompt (blue), combined-prompt and prompt strategy (orange), and answer and decision (yellow)

**Fig. 2**
The research flow of this study. A representative case showing a request containing a single-prompt and the response from ChatGPT (A). The schematic illustrations of the research flow (B). Here also shows the detailed input (made by human researchers) for ChatGPT performance metrics calculation. Single-prompt represents a prompt with only one criterion. Combined-prompt stands for the prompt with more than one criterion. Color of labels: single-prompt (blue), combined-prompt and prompt strategy (orange), answer and decision (yellow), and true outcome of validation datasets (green)

**Fig. 3**
Comparison of the performance of best and full combinations between three prompt strategies. Comparison of the performance between three prompt strategies, regarding precision (A), recall (B), F1 score (C), and workload reduction (D)

**Fig. 4**
Comparison of the performance of best and full combinations between LLMs. Comparison of the performance between 8 models, regarding precision (A), recall (B), F1 score (C), and workload reduction (D). The lower panel shows the results of the corresponding non-parametric multiple comparison with a log10 transformed p value

See this image and copyright information in PMC

Cited by

The emergence of large language models as tools in literature reviews: a large language model-assisted systematic review.
Scherbakov D, Hubig N, Jansari V, Bakumenko A, Lenert LA. Scherbakov D, et al. J Am Med Inform Assoc. 2025 Jun 1;32(6):1071-1086. doi: 10.1093/jamia/ocaf063. J Am Med Inform Assoc. 2025. PMID: 40332983 Free PMC article.

References

1. Subbiah V. The next generation of evidence-based medicine. Nat Med. 2023;29(1):49–58. - PubMed
1. Abdelkader W, Navarro T, Parrish R, Cotoi C, Germini F, Linkins LA, et al. A Deep Learning Approach to Refine the Identification of High-Quality Clinical Research Articles From the Biomedical Literature: Protocol for Algorithm Development and Validation. JMIR Res Protoc. 2021;10(11):e29398. - PMC - PubMed
1. Tercero-Hidalgo JR, Khan KS, Bueno-Cavanillas A, Fernández-López R, Huete JF, Amezcua-Prieto C, et al. Artificial intelligence in COVID-19 evidence syntheses was underutilized, but impactful: a methodological study. J Clin Epidemiol. 2022;148:124–34. - PMC - PubMed
1. Gates A, Gates M, Sebastianski M, Guitard S, Elliott SA, Hartling L. The semi-automation of title and abstract screening: a retrospective exploration of ways to leverage Abstrackr’s relevance predictions in systematic and rapid reviews. BMC Med Res Methodol. 2020;20(1):139. - PMC - PubMed
1. Ferdinands G, Schram R, de Bruin J, Bagheri A, Oberski DL, Tummers L, et al. Performance of active learning models for screening prioritization in systematic reviews: a simulation study into the Average Time to Discover relevant records. Syst Rev. 2023;12(1):100. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

202206090022/China Scholarship Council

LinkOut - more resources

Full Text Sources
- BioMed Central
- PubMed Central

[1] Subbiah V. The next generation of evidence-based medicine. Nat Med. 2023;29(1):49–58. - PubMed

[2] Subbiah V. The next generation of evidence-based medicine. Nat Med. 2023;29(1):49–58. - PubMed

[3] Abdelkader W, Navarro T, Parrish R, Cotoi C, Germini F, Linkins LA, et al. A Deep Learning Approach to Refine the Identification of High-Quality Clinical Research Articles From the Biomedical Literature: Protocol for Algorithm Development and Validation. JMIR Res Protoc. 2021;10(11):e29398. - PMC - PubMed

[4] Abdelkader W, Navarro T, Parrish R, Cotoi C, Germini F, Linkins LA, et al. A Deep Learning Approach to Refine the Identification of High-Quality Clinical Research Articles From the Biomedical Literature: Protocol for Algorithm Development and Validation. JMIR Res Protoc. 2021;10(11):e29398. - PMC - PubMed

[5] Tercero-Hidalgo JR, Khan KS, Bueno-Cavanillas A, Fernández-López R, Huete JF, Amezcua-Prieto C, et al. Artificial intelligence in COVID-19 evidence syntheses was underutilized, but impactful: a methodological study. J Clin Epidemiol. 2022;148:124–34. - PMC - PubMed

[6] Tercero-Hidalgo JR, Khan KS, Bueno-Cavanillas A, Fernández-López R, Huete JF, Amezcua-Prieto C, et al. Artificial intelligence in COVID-19 evidence syntheses was underutilized, but impactful: a methodological study. J Clin Epidemiol. 2022;148:124–34. - PMC - PubMed

[7] Gates A, Gates M, Sebastianski M, Guitard S, Elliott SA, Hartling L. The semi-automation of title and abstract screening: a retrospective exploration of ways to leverage Abstrackr’s relevance predictions in systematic and rapid reviews. BMC Med Res Methodol. 2020;20(1):139. - PMC - PubMed

[8] Gates A, Gates M, Sebastianski M, Guitard S, Elliott SA, Hartling L. The semi-automation of title and abstract screening: a retrospective exploration of ways to leverage Abstrackr’s relevance predictions in systematic and rapid reviews. BMC Med Res Methodol. 2020;20(1):139. - PMC - PubMed

[9] Ferdinands G, Schram R, de Bruin J, Bagheri A, Oberski DL, Tummers L, et al. Performance of active learning models for screening prioritization in systematic reviews: a simulation study into the Average Time to Discover relevant records. Syst Rev. 2023;12(1):100. - PMC - PubMed

[10] Ferdinands G, Schram R, de Bruin J, Bagheri A, Oberski DL, Tummers L, et al. Performance of active learning models for screening prioritization in systematic reviews: a simulation study into the Average Time to Discover relevant records. Syst Rev. 2023;12(1):100. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Utilizing Large language models to select literature for meta-analysis shows workload reduction while maintaining a similar recall level as manual curation

Affiliations

Utilizing Large language models to select literature for meta-analysis shows workload reduction while maintaining a similar recall level as manual curation

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources