. 2025 Jun 1;32(6):1071-1086.

doi: 10.1093/jamia/ocaf063.

The emergence of large language models as tools in literature reviews: a large language model-assisted systematic review

Dmitry Scherbakov¹, Nina Hubig^{1

2}, Vinita Jansari³, Alexander Bakumenko³, Leslie A Lenert¹

Affiliations

¹ Biomedical Informatics Center, Department of Public Health Sciences, Medical University of South Carolina (MUSC), Charleston, SC 29403, United States.
² Interdisciplinary Transformation University, OG 2 A-4040 Linz, Austria.
³ School of Computing, Clemson University, Charleston, SC 29634, United States.

PMID: 40332983
PMCID: PMC12089777
DOI: 10.1093/jamia/ocaf063

The emergence of large language models as tools in literature reviews: a large language model-assisted systematic review

Dmitry Scherbakov et al. J Am Med Inform Assoc. 2025.

. 2025 Jun 1;32(6):1071-1086.

doi: 10.1093/jamia/ocaf063.

Authors

Dmitry Scherbakov¹, Nina Hubig^{1

2}, Vinita Jansari³, Alexander Bakumenko³, Leslie A Lenert¹

Affiliations

¹ Biomedical Informatics Center, Department of Public Health Sciences, Medical University of South Carolina (MUSC), Charleston, SC 29403, United States.
² Interdisciplinary Transformation University, OG 2 A-4040 Linz, Austria.
³ School of Computing, Clemson University, Charleston, SC 29634, United States.

PMID: 40332983
PMCID: PMC12089777
DOI: 10.1093/jamia/ocaf063

Abstract

Objectives: This study aims to summarize the usage of large language models (LLMs) in the process of creating a scientific review by looking at the methodological papers that describe the use of LLMs in review automation and the review papers that mention they were made with the support of LLMs.

Materials and methods: The search was conducted in June 2024 in PubMed, Scopus, Dimensions, and Google Scholar by human reviewers. Screening and extraction process took place in Covidence with the help of LLM add-on based on the OpenAI GPT-4o model. ChatGPT and Scite.ai were used in cleaning the data, generating the code for figures, and drafting the manuscript.

Results: Of the 3788 articles retrieved, 172 studies were deemed eligible for the final review. ChatGPT and GPT-based LLM emerged as the most dominant architecture for review automation (n = 126, 73.2%). A significant number of review automation projects were found, but only a limited number of papers (n = 26, 15.1%) were actual reviews that acknowledged LLM usage. Most citations focused on the automation of a particular stage of review, such as Searching for publications (n = 60, 34.9%) and Data extraction (n = 54, 31.4%). When comparing the pooled performance of GPT-based and BERT-based models, the former was better in data extraction with a mean precision of 83.0% (SD = 10.4) and a recall of 86.0% (SD = 9.8).

Discussion and conclusion: Our LLM-assisted systematic review revealed a significant number of research projects related to review automation using LLMs. Despite limitations, such as lower accuracy of extraction for numeric data, we anticipate that LLMs will soon change the way scientific reviews are conducted.

Keywords: Covidence; large language models; review automation; scoping review; systematic review.

PubMed Disclaimer

Conflict of interest statement

The authors have no competing interests to declare.

Figures

**Figure 1.**
LLM workflow added into Covidence for screening and extraction.

**Figure 2.**
Flow diagram of the systematic review process.

**Figure 3.**
(A) Publications by country of origin. (B) Publications by state in the United States.

**Figure 4.**
(A) Types of automated review. (B) Which stages of review are automated in the paper.

**Figure 5.**
LLM types proposed for automation (models mentioned in 2 or more studies shown).

**Figure 6.**
Performance metrics reported for the 3 most common automated stages for (A) GPT-based models and (B) BERT-based models. Boxplots display median value, while comparison is provided using means.

See this image and copyright information in PMC

References

1. Toh TS, Lee JH. Statistical note: using scoping and systematic reviews. Pediatr Crit Care Med. 2021;22:572-575. - PubMed
1. Abushouk AI, Yunusa I, Elmehrath AO, et al. Quality assessment of published systematic reviews in high impact cardiology journals: revisiting the evidence pyramid. Front Cardiovasc Med 2021;8:671569. - PMC - PubMed
1. Acar IH, Avcılar G, Yazıcı G, Bostancı S. The roles of adolescents’ emotional problems and social media addiction on their self-esteem. Curr Psychol. 2020;41:6838-6847. 10.1007/s12144-020-01174-5 - DOI
1. Borah R, Brown AW, Capers PL, et al. Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ Open. 2017;7:e012545. - PMC - PubMed
1. Munn Z, Peters MDJ, Stern C, et al. Systematic review or scoping review? Guidance for authors when choosing between a systematic or scoping review approach. BMC Med Res Methodol. 2018;18:143-147. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

UL1 TR001450/TR/NCATS NIH HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The emergence of large language models as tools in literature reviews: a large language model-assisted systematic review

Affiliations

The emergence of large language models as tools in literature reviews: a large language model-assisted systematic review

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources