EventEpi-A natural language processing framework for event-based surveillance

Auss Abbood¹, Alexander Ullrich¹, Rüdiger Busche^{2

3}, Stéphane Ghozzi^{1

4}

Affiliations

¹ Robert Koch Institute (RKI), Berlin, Germany.
² Osnabrück University, Osnabrück, Lower Saxony, Germany.
³ inserve GmbH, Hannover, Lower Saxony, Germany.
⁴ Helmholtz Centre for Infection Research (HZI), Brunswick, Lower Saxony, Germany.

PMID: 33216746
PMCID: PMC7717563
DOI: 10.1371/journal.pcbi.1008277

EventEpi-A natural language processing framework for event-based surveillance

Auss Abbood et al. PLoS Comput Biol. 2020.

. 2020 Nov 20;16(11):e1008277.

doi: 10.1371/journal.pcbi.1008277. eCollection 2020 Nov.

Authors

Auss Abbood¹, Alexander Ullrich¹, Rüdiger Busche^{2

3}, Stéphane Ghozzi^{1

4}

Affiliations

¹ Robert Koch Institute (RKI), Berlin, Germany.
² Osnabrück University, Osnabrück, Lower Saxony, Germany.
³ inserve GmbH, Hannover, Lower Saxony, Germany.
⁴ Helmholtz Centre for Infection Research (HZI), Brunswick, Lower Saxony, Germany.

PMID: 33216746
PMCID: PMC7717563
DOI: 10.1371/journal.pcbi.1008277

Abstract

According to the World Health Organization (WHO), around 60% of all outbreaks are detected using informal sources. In many public health institutes, including the WHO and the Robert Koch Institute (RKI), dedicated groups of public health agents sift through numerous articles and newsletters to detect relevant events. This media screening is one important part of event-based surveillance (EBS). Reading the articles, discussing their relevance, and putting key information into a database is a time-consuming process. To support EBS, but also to gain insights into what makes an article and the event it describes relevant, we developed a natural language processing framework for automated information extraction and relevance scoring. First, we scraped relevant sources for EBS as done at the RKI (WHO Disease Outbreak News and ProMED) and automatically extracted the articles' key data: disease, country, date, and confirmed-case count. For this, we performed named entity recognition in two steps: EpiTator, an open-source epidemiological annotation tool, suggested many different possibilities for each. We extracted the key country and disease using a heuristic with good results. We trained a naive Bayes classifier to find the key date and confirmed-case count, using the RKI's EBS database as labels which performed modestly. Then, for relevance scoring, we defined two classes to which any article might belong: The article is relevant if it is in the EBS database and irrelevant otherwise. We compared the performance of different classifiers, using bag-of-words, document and word embeddings. The best classifier, a logistic regression, achieved a sensitivity of 0.82 and an index balanced accuracy of 0.61. Finally, we integrated these functionalities into a web application called EventEpi where relevant sources are automatically analyzed and put into a database. The user can also provide any URL or text, that will be analyzed in the same way and added to the database. Each of these steps could be improved, in particular with larger labeled datasets and fine-tuning of the learning algorithms. The overall framework, however, works already well and can be used in production, promising improvements in EBS. The source code and data are publicly available under open licenses.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. An illustration of the *EventEpi* architecture.**
The orange part of the plot describes the relevance scoring of epidemiological texts vectorized with word embeddings (created with word2vec), document embeddings (mean over word embeddings), and bag-of-words, and fed to different classification algorithms (support vector machine (SVM), k-nearest neighbor (kNN) and logistic regression (LR) among others). The part of *EventEpi* that extracts the key information is colored in blue. Key information extraction is trained on sentences containing named entities using a naive Bayes classifier or the most-frequent approach applied to the output of EpiTator, a epidemiological annotation software. The workflow ends with the results being saved into the *EventEpi* database that is embedded into *EventEpi*’s web application.

**Fig 2. A layer-wise relevance propagation of the CNN for relevance classification.**
This text was correctly classified as relevant. Words that are highlighted in red contributed to the classification of the article being *relevant* and blue words contradicted this classification. The saturation of the color indicates the strength of which the single words contributed to the classification. <UNK> indicates a token for which no word embedding is available.

**Fig 3. A screenshot of the *EventEpi* web application.**
The top input text field receives an URL. This URL is summarized if the SUMMARIZE button is pushed. The result of this summary is entered into the datatable, which is displayed as a table. The buttons Get WHO DONs and Get Promed Articles automatically scrape the last articles form both platforms that are not yet in the datatable. Furthermore, the user can search for words in the search text field and download the datatable as CSV, Excel or PDF.

See this image and copyright information in PMC

References

1. WHO. Epidemiology; 2014. Available from: https://www.who.int/topics/epidemiology/en/.
1. WHO. Early detection, assessment and response to acute public health events. WHO. 2014.
1. Stephen DM, Barnett AG. Effect of temperature and precipitation on salmonellosis cases in South-East Queensland, Australia: an observational study. BMJ Open. 2016;6(2). 10.1136/bmjopen-2015-010204 - DOI - PMC - PubMed
1. Taylor DL, Kahawita TM, Cairncross S, Ensink JHJ. The Impact of Water, Sanitation and Hygiene Interventions to Control Cholera: A Systematic Review. PLOS ONE. 2015;10(8):e0135676 10.1371/journal.pone.0135676 - DOI - PMC - PubMed
1. Kaiser R, Coulombier D, Baldari M, Morgan D, Paquet C. What is epidemic intelligence, and how is it being improved in Europe? Euro Surveillance. 2006;11(5). 10.2807/esw.11.05.02892-en. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Associated data

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

EventEpi-A natural language processing framework for event-based surveillance

Affiliations

EventEpi-A natural language processing framework for event-based surveillance

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Associated data

LinkOut - more resources

Full Text Sources