Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jun 5;12(1):94.
doi: 10.1186/s13643-023-02247-9.

Ensemble of deep learning language models to support the creation of living systematic reviews for the COVID-19 literature

Affiliations

Ensemble of deep learning language models to support the creation of living systematic reviews for the COVID-19 literature

Julien Knafou et al. Syst Rev. .

Abstract

Background: The COVID-19 pandemic has led to an unprecedented amount of scientific publications, growing at a pace never seen before. Multiple living systematic reviews have been developed to assist professionals with up-to-date and trustworthy health information, but it is increasingly challenging for systematic reviewers to keep up with the evidence in electronic databases. We aimed to investigate deep learning-based machine learning algorithms to classify COVID-19-related publications to help scale up the epidemiological curation process.

Methods: In this retrospective study, five different pre-trained deep learning-based language models were fine-tuned on a dataset of 6365 publications manually classified into two classes, three subclasses, and 22 sub-subclasses relevant for epidemiological triage purposes. In a k-fold cross-validation setting, each standalone model was assessed on a classification task and compared against an ensemble, which takes the standalone model predictions as input and uses different strategies to infer the optimal article class. A ranking task was also considered, in which the model outputs a ranked list of sub-subclasses associated with the article.

Results: The ensemble model significantly outperformed the standalone classifiers, achieving a F1-score of 89.2 at the class level of the classification task. The difference between the standalone and ensemble models increases at the sub-subclass level, where the ensemble reaches a micro F1-score of 70% against 67% for the best-performing standalone model. For the ranking task, the ensemble obtained the highest recall@3, with a performance of 89%. Using an unanimity voting rule, the ensemble can provide predictions with higher confidence on a subset of the data, achieving detection of original papers with a F1-score up to 97% on a subset of 80% of the collection instead of 93% on the whole dataset.

Conclusion: This study shows the potential of using deep learning language models to perform triage of COVID-19 references efficiently and support epidemiological curation and review. The ensemble consistently and significantly outperforms any standalone model. Fine-tuning the voting strategy thresholds is an interesting alternative to annotate a subset with higher predictive confidence.

Keywords: COVID-19; Deep learning; Language model; Literature screening; Living systematic review; Text classification; Transfer learning.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Overview of the study design. All articles were manually annotated and then the title, abstract, and source retrieved. In a k-fold cross-validation setting (k is set to 5 in our experiments), 5 models were fine-tuned, and each standalone model was compared against each other as well as against two types of ensemble
Fig. 2
Fig. 2
Publication classifier workflow. The model starts with the title, abstract, and source fields and concatenates their text contents before tokenizing it. Each model computes their predictions, and an ensemble strategy, voting or probability sum, combines them to get a final prediction
Fig. 3
Fig. 3
A Precision/recall curves of the ORIGINAL class for the RoBERTa base/large and the ensemble. B Precision/recall curves obtained by the ensemble model for the sub-subclasses. Well-represented sub-subclasses usually perform better than underrepresented ones
Fig. 4
Fig. 4
Confusion matrix for class (A), subclass (B), and sub-subclass (C). The ensemble has a higher probability of confusing sub-subclasses inside their nested subclasses and classes which is why performances tend to be higher at those higher levels
Fig. 5
Fig. 5
F1-score (A)/precision (B)/recall (C) for the ORIGINAL class with respect to a probability threshold per vote when using the voting strategy across the predictions on the class level. Using different thresholds improves considerably performance while reducing the number of predicted publications
Fig. 6
Fig. 6
A, B, and C Top 20 positive impact words for either EPI (A), BASIC (B), or OTHER (C) subclasses when taking the integrated gradient on a never-seen set of about 600 documents. D, E, and F Classification examples with a focus on passages with impact word scores

References

    1. Chen Q, Allot A, Lu Z. LitCovid: an open database of COVID-19 literature. Nucleic Acids Res. 2021;49(D1):D1534–D1540. doi: 10.1093/nar/gkaa952. - DOI - PMC - PubMed
    1. Ipekci AM, Buitrago-Garcia D, Meili KW, Krauer F, Prajapati N, Thapa S, et al. Outbreaks of publications about emerging infectious diseases: the case of SARS-CoV-2 and Zika virus. BMC Med Res Methodol. 2021;50–50. - PMC - PubMed
    1. Lu Wang L, Lo K, Chandrasekhar Y, Reas R, Yang J, Eide D, et al. CORD-19: the Covid-19 Open Research Dataset. 2020 Available from: https://search.bvsalud.org/global-literature-on-novel-coronavirus-2019-n.... [Cited 29 Jun 2022].
    1. Counotte M, Imeri H, Leonie H, Ipekci M, Low N. Living evidence on COVID-19. 2020 Available from: https://ispmbern.github.io/covid-19/living-review/. [Cited 29 Jun 2022].
    1. The COVID-NMA initiative. Available from: https://covid-nma.com/. [Cited 29 Jun 2022].

Publication types

Grants and funding