Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Apr 28;25(1):116.
doi: 10.1186/s12874-025-02569-3.

Utilizing Large language models to select literature for meta-analysis shows workload reduction while maintaining a similar recall level as manual curation

Affiliations

Utilizing Large language models to select literature for meta-analysis shows workload reduction while maintaining a similar recall level as manual curation

Xiangming Cai et al. BMC Med Res Methodol. .

Abstract

Background: Large language models (LLMs) like ChatGPT showed great potential in aiding medical research. A heavy workload in filtering records is needed during the research process of evidence-based medicine, especially meta-analysis. However, few studies tried to use LLMs to help screen records in meta-analysis.

Objective: In this research, we aimed to explore the possibility of incorporating multiple LLMs to facilitate the screening step based on the title and abstract of records during meta-analysis.

Methods: Various LLMs were evaluated, which includes GPT-3.5, GPT-4, Deepseek-R1-Distill, Qwen-2.5, Phi-4, Llama-3.1, Gemma-2 and Claude-2. To assess our strategy, we selected three meta-analyses from the literature, together with a glioma meta-analysis embedded in the study, as additional validation. For the automatic selection of records from curated meta-analyses, a four-step strategy called LARS-GPT was developed, consisting of (1) criteria selection and single-prompt (prompt with one criterion) creation, (2) best combination identification, (3) combined-prompt (prompt with one or more criteria) creation, and (4) request sending and answer summary. Recall, workload reduction, precision, and F1 score were calculated to assess the performance of LARS-GPT.

Results: A variable performance was found between different single-prompts, with a mean recall of 0.800. Based on these single-prompts, we were able to find combinations with better performance than the pre-set threshold. Finally, with a best combination of criteria identified, LARS-GPT showed a 40.1% workload reduction on average with a recall greater than 0.9.

Conclusions: We show here the groundbreaking finding that automatic selection of literature for meta-analysis is possible with LLMs. We provide it here as a pipeline, LARS-GPT, which showed a great workload reduction while maintaining a pre-set recall.

Keywords: ChatGPT; Deepseek; Large language model; Meta-analysis; Phi.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: Not applicable. Consent for publication: Not applicable. Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Schematic illustration of the LARS-GPT pipeline. Single-prompt represents a prompt with only one criterion. Combined-prompt stands for prompt with more than one criterion. Color of labels: single-prompt (blue), combined-prompt and prompt strategy (orange), and answer and decision (yellow)
Fig. 2
Fig. 2
The research flow of this study. A representative case showing a request containing a single-prompt and the response from ChatGPT (A). The schematic illustrations of the research flow (B). Here also shows the detailed input (made by human researchers) for ChatGPT performance metrics calculation. Single-prompt represents a prompt with only one criterion. Combined-prompt stands for the prompt with more than one criterion. Color of labels: single-prompt (blue), combined-prompt and prompt strategy (orange), answer and decision (yellow), and true outcome of validation datasets (green)
Fig. 3
Fig. 3
Comparison of the performance of best and full combinations between three prompt strategies. Comparison of the performance between three prompt strategies, regarding precision (A), recall (B), F1 score (C), and workload reduction (D)
Fig. 4
Fig. 4
Comparison of the performance of best and full combinations between LLMs. Comparison of the performance between 8 models, regarding precision (A), recall (B), F1 score (C), and workload reduction (D). The lower panel shows the results of the corresponding non-parametric multiple comparison with a log10 transformed p value

Similar articles

Cited by

References

    1. Subbiah V. The next generation of evidence-based medicine. Nat Med. 2023;29(1):49–58. - PubMed
    1. Abdelkader W, Navarro T, Parrish R, Cotoi C, Germini F, Linkins LA, et al. A Deep Learning Approach to Refine the Identification of High-Quality Clinical Research Articles From the Biomedical Literature: Protocol for Algorithm Development and Validation. JMIR Res Protoc. 2021;10(11):e29398. - PMC - PubMed
    1. Tercero-Hidalgo JR, Khan KS, Bueno-Cavanillas A, Fernández-López R, Huete JF, Amezcua-Prieto C, et al. Artificial intelligence in COVID-19 evidence syntheses was underutilized, but impactful: a methodological study. J Clin Epidemiol. 2022;148:124–34. - PMC - PubMed
    1. Gates A, Gates M, Sebastianski M, Guitard S, Elliott SA, Hartling L. The semi-automation of title and abstract screening: a retrospective exploration of ways to leverage Abstrackr’s relevance predictions in systematic and rapid reviews. BMC Med Res Methodol. 2020;20(1):139. - PMC - PubMed
    1. Ferdinands G, Schram R, de Bruin J, Bagheri A, Oberski DL, Tummers L, et al. Performance of active learning models for screening prioritization in systematic reviews: a simulation study into the Average Time to Discover relevant records. Syst Rev. 2023;12(1):100. - PMC - PubMed

LinkOut - more resources