Validation of automated paper screening for esophagectomy systematic review using large language models

Rashi Ramchandani^{1

2}, Eddie Guo³, Esra Rakab¹, Jharna Rathod¹, Jamie Strain⁴, William Klement^{4

5}, Risa Shorr⁶, Erin Williams^{7

8}, Daniel Jones^{7

8}, Sebastien Gilbert^{7

8}

Affiliations

¹ Department of Medicine, University of Ottawa, Ottawa, Ontario, Canada.
² Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, Ontario, Canada.
³ Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada.
⁴ Ottawa Hospital Research Institute, Ottawa, Ontario, Canada.
⁵ Faculty of Computer Sciences, Dalhousie University, Dalhousie, Halifax, Canada.
⁶ Library and Learning Services, The Ottawa Hospital, Ottawa, Ontario, Canada.
⁷ Division of General Surgery, Department of Surgery, The Ottawa Hospital, Ottawa, Ontario, Canada.
⁸ Division of Thoracic Surgery, Department of Surgery, The Ottawa Hospital, Ottawa, Ontario, Canada.

PMID: 40567772
PMCID: PMC12190591
DOI: 10.7717/peerj-cs.2822

Validation of automated paper screening for esophagectomy systematic review using large language models

Rashi Ramchandani et al. PeerJ Comput Sci. 2025.

. 2025 Apr 30:11:e2822.

doi: 10.7717/peerj-cs.2822. eCollection 2025.

Authors

Rashi Ramchandani^{1

2}, Eddie Guo³, Esra Rakab¹, Jharna Rathod¹, Jamie Strain⁴, William Klement^{4

5}, Risa Shorr⁶, Erin Williams^{7

8}, Daniel Jones^{7

8}, Sebastien Gilbert^{7

8}

Affiliations

¹ Department of Medicine, University of Ottawa, Ottawa, Ontario, Canada.
² Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, Ontario, Canada.
³ Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada.
⁴ Ottawa Hospital Research Institute, Ottawa, Ontario, Canada.
⁵ Faculty of Computer Sciences, Dalhousie University, Dalhousie, Halifax, Canada.
⁶ Library and Learning Services, The Ottawa Hospital, Ottawa, Ontario, Canada.
⁷ Division of General Surgery, Department of Surgery, The Ottawa Hospital, Ottawa, Ontario, Canada.
⁸ Division of Thoracic Surgery, Department of Surgery, The Ottawa Hospital, Ottawa, Ontario, Canada.

PMID: 40567772
PMCID: PMC12190591
DOI: 10.7717/peerj-cs.2822

Abstract

Background: Large language models (LLMs) offer a potential solution to the labor-intensive nature of systematic reviews. This study evaluated the ability of the GPT model to identify articles that discuss perioperative risk factors for esophagectomy complications. To test the performance of the model, we tested GPT-4 on narrower inclusion criterion and by assessing its ability to discriminate relevant articles that solely identified preoperative risk factors for esophagectomy.

Methods: A literature search was run by a trained librarian to identify studies (n = 1,967) discussing risk factors to esophagectomy complications. The articles underwent title and abstract screening by three independent human reviewers and GPT-4. The Python script used for the analysis made Application Programming Interface (API) calls to GPT-4 with screening criteria in natural language. GPT-4's inclusion and exclusion decision were compared to those decided human reviewers.

Results: The agreement between the GPT model and human decision was 85.58% for perioperative factors and 78.75% for preoperative factors. The AUC value was 0.87 and 0.75 for the perioperative and preoperative risk factors query, respectively. In the evaluation of perioperative risk factors, the GPT model demonstrated a high recall for included studies at 89%, a positive predictive value of 74%, and a negative predictive value of 84%, with a low false positive rate of 6% and a macro-F1 score of 0.81. For preoperative risk factors, the model showed a recall of 67% for included studies, a positive predictive value of 65%, and a negative predictive value of 85%, with a false positive rate of 15% and a macro-F1 score of 0.66. The interobserver reliability was substantial, with a kappa score of 0.69 for perioperative factors and 0.61 for preoperative factors. Despite lower accuracy under more stringent criteria, the GPT model proved valuable in streamlining the systematic review workflow. Preliminary evaluation of inclusion and exclusion justification provided by the GPT model were reported to have been useful by study screeners, especially in resolving discrepancies during title and abstract screening.

Conclusion: This study demonstrates promising use of LLMs to streamline the workflow of systematic reviews. The integration of LLMs in systematic reviews could lead to significant time and cost savings, however caution must be taken for reviews involving stringent a narrower and exclusion criterion. Future research is needed and should explore integrating LLMs in other steps of the systematic review, such as full text screening or data extraction, and compare different LLMs for their effectiveness in various types of systematic reviews.

Keywords: Abstract screening; ChatGPT; Large language model; Screening; Systematic review.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Figure 1. Visual depiction of study methodology from literature search to analysis.**
Figure depicts the workflow and methodology of the study describing the two runs of the Python script with the broad (perioperative risk factors) and narrow (perioperative risk factors) screening.

**Figure 2. AUC curves for (A) peri-operative and (B) pre-operative risk factors.**
AUC score of perioperative run was 0.87, while for the preoperative run it was 0.75 which indicates that the GPT-model is able to differentiate between relevant and non-relevant articles, particularly performing well above the threshold of random chance, which is typically considered to be 0.5.

See this image and copyright information in PMC

References

1. Briganti G. How ChatGPT works: a mini review. European Archives of Oto-Rhino-Laryngology: Official Journal of the European Federation of Oto-Rhino-Laryngological Societies (EUFOS): Affiliated with the German Society for Oto-Rhino-Laryngology—Head and Neck Surgery. 2024;281(3):1565–1569. doi: 10.1007/s00405-023-08337-7. - DOI - PubMed
1. Chai KEK, Lines RLJ, Gucciardi DF, Ng L. Research Screener: a machine learning tool to semi-automate abstract screening for systematic reviews. Systematic Reviews. 2021;10(1):93. doi: 10.1186/s13643-021-01635-3. - DOI - PMC - PubMed
1. Clark J, McFarlane C, Cleo G, Ishikawa Ramos C, Marshall S. The impact of systematic review automation tools on methodological quality and time taken to complete systematic review tasks: case study. JMIR Medical Education. 2021;7(2):e24418. doi: 10.2196/24418. - DOI - PMC - PubMed
1. Datt M, Sharma H, Aggarwal N, Sharma S. Role of ChatGPT-4 for medical researchers. Annals of Biomedical Engineering. 2024;52(6):1534–1536. doi: 10.1007/s10439-023-03336-5. - DOI - PubMed
1. Gates A, Johnson C, Hartling L. Technology-assisted title and abstract screening for systematic reviews: a retrospective evaluation of the Abstrackr machine learning tool. Systematic Reviews. 2018;7(1):45. doi: 10.1186/s13643-018-0707-8. - DOI - PMC - PubMed

LinkOut - more resources

Full Text Sources
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Validation of automated paper screening for esophagectomy systematic review using large language models

Affiliations

Validation of automated paper screening for esophagectomy systematic review using large language models

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources