Title and abstract screening for literature reviews using large language models: an exploratory study in the biomedical domain

Fabio Dennstädt^{1

2}, Johannes Zink³, Paul Martin Putora^{4

5}, Janna Hastings^{6

7

8}, Nikola Cihoric⁵

Affiliations

¹ Department of Radiation Oncology, Cantonal Hospital of St. Gallen, St. Gallen, Switzerland. fabiodennstaedt@gmx.de.
² Department of Radiation Oncology, Inselspital, Bern University Hospital and University of Bern, Bern, Switzerland. fabiodennstaedt@gmx.de.
³ Institute for Computer Science, University of Würzburg, Würzburg, Germany.
⁴ Department of Radiation Oncology, Cantonal Hospital of St. Gallen, St. Gallen, Switzerland.
⁵ Department of Radiation Oncology, Inselspital, Bern University Hospital and University of Bern, Bern, Switzerland.
⁶ Institute for Implementation Science in Health Care, University of Zurich, Zurich, Switzerland.
⁷ School of Medicine, University of St. Gallen, St. Gallen, Switzerland.
⁸ Swiss Institute of Bioinformatics, Lausanne, Switzerland.

PMID: 38879534
PMCID: PMC11180407
DOI: 10.1186/s13643-024-02575-4

Title and abstract screening for literature reviews using large language models: an exploratory study in the biomedical domain

Fabio Dennstädt et al. Syst Rev. 2024.

. 2024 Jun 15;13(1):158.

doi: 10.1186/s13643-024-02575-4.

Authors

Fabio Dennstädt^{1

2}, Johannes Zink³, Paul Martin Putora^{4

5}, Janna Hastings^{6

7

8}, Nikola Cihoric⁵

Affiliations

¹ Department of Radiation Oncology, Cantonal Hospital of St. Gallen, St. Gallen, Switzerland. fabiodennstaedt@gmx.de.
² Department of Radiation Oncology, Inselspital, Bern University Hospital and University of Bern, Bern, Switzerland. fabiodennstaedt@gmx.de.
³ Institute for Computer Science, University of Würzburg, Würzburg, Germany.
⁴ Department of Radiation Oncology, Cantonal Hospital of St. Gallen, St. Gallen, Switzerland.
⁵ Department of Radiation Oncology, Inselspital, Bern University Hospital and University of Bern, Bern, Switzerland.
⁶ Institute for Implementation Science in Health Care, University of Zurich, Zurich, Switzerland.
⁷ School of Medicine, University of St. Gallen, St. Gallen, Switzerland.
⁸ Swiss Institute of Bioinformatics, Lausanne, Switzerland.

PMID: 38879534
PMCID: PMC11180407
DOI: 10.1186/s13643-024-02575-4

Abstract

Background: Systematically screening published literature to determine the relevant publications to synthesize in a review is a time-consuming and difficult task. Large language models (LLMs) are an emerging technology with promising capabilities for the automation of language-related tasks that may be useful for such a purpose.

Methods: LLMs were used as part of an automated system to evaluate the relevance of publications to a certain topic based on defined criteria and based on the title and abstract of each publication. A Python script was created to generate structured prompts consisting of text strings for instruction, title, abstract, and relevant criteria to be provided to an LLM. The relevance of a publication was evaluated by the LLM on a Likert scale (low relevance to high relevance). By specifying a threshold, different classifiers for inclusion/exclusion of publications could then be defined. The approach was used with four different openly available LLMs on ten published data sets of biomedical literature reviews and on a newly human-created data set for a hypothetical new systematic literature review.

Results: The performance of the classifiers varied depending on the LLM being used and on the data set analyzed. Regarding sensitivity/specificity, the classifiers yielded 94.48%/31.78% for the FlanT5 model, 97.58%/19.12% for the OpenHermes-NeuralChat model, 81.93%/75.19% for the Mixtral model and 97.58%/38.34% for the Platypus 2 model on the ten published data sets. The same classifiers yielded 100% sensitivity at a specificity of 12.58%, 4.54%, 62.47%, and 24.74% on the newly created data set. Changing the standard settings of the approach (minor adaption of instruction prompt and/or changing the range of the Likert scale from 1-5 to 1-10) had a considerable impact on the performance.

Conclusions: LLMs can be used to evaluate the relevance of scientific publications to a certain review topic and classifiers based on such an approach show some promising results. To date, little is known about how well such systems would perform if used prospectively when conducting systematic literature reviews and what further implications this might have. However, it is likely that in the future researchers will increasingly use LLMs for evaluating and classifying scientific publications.

Keywords: Biomedicine; Large language models; Natural language processing; Systematic literature review; Title and abstract screening.

PubMed Disclaimer

Conflict of interest statement

NC is a technical lead for the SmartOncology© project and medical advisor for Wemedoo AG, Steinhausen AG, Switzerland. The authors declare that they have no other competing interests.

Figures

**Fig. 1**
Schematic illustration of the LLM-based approach for evaluating the relevance of a scientific publication. In this example, a 1–5 scale and a 3 + classifier are used

**Fig. 2**
Distribution of scores given by the different models

**Fig. 3**
Sensitivity and specificity of the 3 + classifiers on different data sets using different models. Each data point represents the results of one of the data sets

**Fig. 4**
Receiver operating characteristics (ROC) curves of the LLM-based title and abstract screening for the different models on the CDSS_RO data set

**Fig. 5**
Performance of the classifiers depending on adaptation of the prompt and on the range of scale

See this image and copyright information in PMC

References

1. Khalil H, Ameen D, Zarnegar A. Tools to support the automation of systematic reviews: a scoping review. J Clin Epidemiol. 2022;144:22–42. doi: 10.1016/j.jclinepi.2021.12.005. - DOI - PubMed
1. Clark J, Scott AM, Glasziou P. Not all systematic reviews can be completed in 2 weeks—But many can be (and should be) J Clin Epidemiol. 2020;126:163. doi: 10.1016/j.jclinepi.2020.06.035. - DOI - PubMed
1. Clark J, Glasziou P, Del Mar C, Bannach-Brown A, Stehlik P, Scott AM. A full systematic review was completed in 2 weeks using automation tools: a case study. J Clin Epidemiol. 2020;121:81–90. doi: 10.1016/j.jclinepi.2020.01.008. - DOI - PubMed
1. Pham B, Jovanovic J, Bagheri E, Antony J, Ashoor H, Nguyen TT, et al. Text mining to support abstract screening for knowledge syntheses: a semi-automated workflow. Syst Rev. 2021;10(1):156. doi: 10.1186/s13643-021-01700-x. - DOI - PMC - PubMed
1. van de Schoot R, de Bruin J, Schram R, Zahedi P, de Boer J, Weijdema F, et al. An open source machine learning framework for efficient and transparent systematic reviews. Nat Mach Intell. 2021;3(2):125–133. doi: 10.1038/s42256-020-00287-7. - DOI

MeSH terms

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Title and abstract screening for literature reviews using large language models: an exploratory study in the biomedical domain

Affiliations

Title and abstract screening for literature reviews using large language models: an exploratory study in the biomedical domain

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources