. 2023 Jan 3;12(1):1.

doi: 10.1186/s13643-022-02163-4.

Unsupervised title and abstract screening for systematic review: a retrospective case-study using topic modelling methodology

Agnes Natukunda^{1

2}, Leacky K Muchene³

Affiliations

¹ Immunomodulation and Vaccines Programme, MRC/UVRI and LSHTM Uganda Research Unit, Entebbe, Uganda. natukundagnes2@gmail.com.
² MRC International Statistics and Epidemiology Group, Department of Infectious Disease Epidemiology, LSHTM, London, UK. natukundagnes2@gmail.com.
³ StatsDecide Analytics and Consulting Limited, Nairobi, Kenya.

PMID: 36597132
PMCID: PMC9811792
DOI: 10.1186/s13643-022-02163-4

Unsupervised title and abstract screening for systematic review: a retrospective case-study using topic modelling methodology

Agnes Natukunda et al. Syst Rev. 2023.

. 2023 Jan 3;12(1):1.

doi: 10.1186/s13643-022-02163-4.

Authors

Agnes Natukunda^{1

2}, Leacky K Muchene³

Affiliations

¹ Immunomodulation and Vaccines Programme, MRC/UVRI and LSHTM Uganda Research Unit, Entebbe, Uganda. natukundagnes2@gmail.com.
² MRC International Statistics and Epidemiology Group, Department of Infectious Disease Epidemiology, LSHTM, London, UK. natukundagnes2@gmail.com.
³ StatsDecide Analytics and Consulting Limited, Nairobi, Kenya.

PMID: 36597132
PMCID: PMC9811792
DOI: 10.1186/s13643-022-02163-4

Abstract

Background: The importance of systematic reviews in collating and summarising available research output on a particular topic cannot be over-emphasized. However, initial screening of retrieved literature is significantly time and labour intensive. Attempts at automating parts of the systematic review process have been made with varying degree of success partly due to being domain-specific, requiring vendor-specific software or manually labelled training data. Our primary objective was to develop statistical methodology for performing automated title and abstract screening for systematic reviews. Secondary objectives included (1) to retrospectively apply the automated screening methodology to previously manually screened systematic reviews and (2) to characterize the performance of the automated screening methodology scoring algorithm in a simulation study.

Methods: We implemented a Latent Dirichlet Allocation-based topic model to derive representative topics from the retrieved documents' title and abstract. The second step involves defining a score threshold for classifying the documents as relevant for full-text review or not. The score is derived based on a set of search keywords (often the database retrieval search terms). Two systematic review studies were retrospectively used to illustrate the methodology.

Results: In one case study (helminth dataset), [Formula: see text] sensitivity compared to manual title and abstract screening was achieved. This is against a false positive rate of [Formula: see text]. For the second case study (Wilson disease dataset), a sensitivity of [Formula: see text] and specificity of [Formula: see text] were achieved.

Conclusions: Unsupervised title and abstract screening has the potential to reduce the workload involved in conducting systematic review. While sensitivity of the methodology on the tested data is low, approximately [Formula: see text] specificity was achieved. Users ought to keep in mind that potentially low sensitivity might occur. One approach to mitigate this might be to incorporate additional targeted search keywords such as the indexing databases terms into the search term copora. Moreover, automated screening can be used as an additional screener to the manual screeners.

Keywords: Abstract screening; Automated systematic review; Latent Dirichlet Allocation; Topic modelling; Unsupervised learning.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
The helminths data: grid search of number of LDA topics. Both Aruna2010 and CaoJuan2009 metrics are based on minimization while Griffiths2004 is a metric based on maximization of the corresponding algorithm. Y-axis: Normalized measure of performance

**Fig. 2**
The helminths data: Word-topic probability matrix $β$ . Top 5 words (in their root form) for each of the 20 topics. X-axis: posterior word-topic probability

**Fig. 3**
The helminths data: document-topic probability matrix $γ$ for the full corpus. X-axis: posterior document-topic probability. Y-axis: individual documents in the corpus. The panels denote documents considered relevant (or not) for full-text screening based on manual evaluation

**Fig. 4**
The helminths data: search keywords posterior word-topic probability $β$ . Fill colour gradient: posterior word-topic probability. X-axis: individual search keywords in the corpus. Y-axis: Topics. NA: search keyword did not occur in documents

**Fig. 5**
The helminths data: search keywords sum of word-topic probability $β$ . X-axis: posterior word-topic probability. Y-axis: individual search keywords in the corpus. Colour: corresponding percentile (of the sum of search term scores) the topics cover

**Fig. 6**
Simulation study: average sensitivity results. X-axis: number of LDA topics. TPR, true positive rate. Rows: proportion of relevant documents included in the simulation dataset. Columns: percentile used to compute the relevance threshold. Solid circles: average TPR. Error bars: 95% confidence interval. Horizontal dashed line: 50% TPR

**Fig. 7**
Simulation study: average false positive rate results. X-axis: number of LDA topics. FPR, false positive rate. Rows: proportion of relevant documents included in the simulation dataset. Columns: percentile used to compute the relevance threshold. Solid circles: average FPR. Error bars: 95% confidence interval. Horizontal dashed line: 50% FPR

**Fig. 8**
The Wilson data: grid search of number of LDA topics. Both Aruna2010 and CaoJuan2009 metrics are based on minimization while Griffiths2004 is a metric based on maximization of the corresponding algorithm. Y-axis: Normalized measure of performance

**Fig. 9**
The Wilson data: search keywords sum of word-topic probability $β$ . X-axis: posterior word-topic probability. Y-axis: individual search keywords in the corpus

**Fig. 10**
The Wilson data: search keywords sum of word-topic probability $β$ . X-axis: posterior word-topic probability. Y-axis: individual search keywords in the corpus. Colour: corresponding percentile (of the sum of search term scores) the topics cover

**Fig. 11**
Example scree plot: Determining an optimal number of topics to model. X-axis: Number of topics. Y-axis: Normalized metric score. Colour bands: interpretation of the metric scores for the corresponding number of topics

See this image and copyright information in PMC

References

1. JPT H, J T, J C, M C, T L, MJ P, et al., editors. Cochrane Handbook for Systematic Reviews of Interventions. 2nd ed. Chichester: Wiley; 2019.
1. Clarke J. What is a systematic review? Evid-Based Nurs. 2011;14(3):64. doi: 10.1136/ebn.2011.0049. - DOI - PubMed
1. Kwon HR, Silva EA. Mapping the Landscape of Behavioral Theories: Systematic Literature Review. J Plan Lit. 2019;35(2):161–179. doi: 10.1177/0885412219881135. - DOI
1. Bilotta GS, Milner AM, Boyd I. On the use of systematic reviews to inform environmental policies. Environ Sci Policy. 2014;42:67–77. doi: 10.1016/j.envsci.2014.05.010. - DOI
1. Zawacki-Richter O, Kerres M, Bedenlier S, Bond M, Buntins K, editors. Systematic Reviews in Educational Research. USA: Springer Fachmedien Wiesbaden; 2020. 10.1007/978-3-658-27602-7.

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Unsupervised title and abstract screening for systematic review: a retrospective case-study using topic modelling methodology

Affiliations

Unsupervised title and abstract screening for systematic review: a retrospective case-study using topic modelling methodology

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources