Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Dec 9;16(12):e0260402.
doi: 10.1371/journal.pone.0260402. eCollection 2021.

Using topic modelling for unsupervised annotation of electronic health records to identify an outbreak of disease in UK dogs

Affiliations

Using topic modelling for unsupervised annotation of electronic health records to identify an outbreak of disease in UK dogs

Peter-John Mäntylä Noble et al. PLoS One. .

Abstract

A key goal of disease surveillance is to identify outbreaks of known or novel diseases in a timely manner. Such an outbreak occurred in the UK associated with acute vomiting in dogs between December 2019 and March 2020. We tracked this outbreak using the clinical free text component of anonymised electronic health records (EHRs) collected from a sentinel network of participating veterinary practices. We sourced the free text (narrative) component of each EHR supplemented with one of 10 practitioner-derived main presenting complaints (MPCs), with the 'gastroenteric' MPC identifying cases involved in the disease outbreak. Such clinician-derived annotation systems can suffer from poor compliance requiring retrospective, often manual, coding, thereby limiting real-time usability, especially where an outbreak of a novel disease might not present clinically as a currently recognised syndrome or MPC. Here, we investigate the use of an unsupervised method of EHR annotation using latent Dirichlet allocation topic-modelling to identify topics inherent within the clinical narrative component of EHRs. The model comprised 30 topics which were used to annotate EHRs spanning the natural disease outbreak and investigate whether any given topic might mirror the outbreak time-course. Narratives were annotated using the Gensim Library LdaModel module for the topic best representing the text within them. Counts for narratives labelled with one of the topics significantly matched the disease outbreak based on the practitioner-derived 'gastroenteric' MPC (Spearman correlation 0.978); no other topics showed a similar time course. Using artificially injected outbreaks, it was possible to see other topics that would match other MPCs including respiratory disease. The underlying topics were readily evaluated using simple word-cloud representations and using a freely available package (LDAVis) providing rapid insight into the clinical basis of each topic. This work clearly shows that unsupervised record annotation using topic modelling linked to simple text visualisations can provide an easily interrogable method to identify and characterise outbreaks and other anomalies of known and previously un-characterised diseases based on changes in clinical narratives.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. SAVSNET MPC window.
At the end of each consultation, veterinary clinicians are required to annotate the health record selecting from a list of syndromes or main presenting complaints (MPCs) as shown.
Fig 2
Fig 2. Count and topic distribution for different MPCs amongst 3,499,566 veterinary electronic health records.
(A) Proportion of narratives belonging to each MPC classified as relating to each topic. (B) Adjusted mutual information score for topics with the indicated MPCs showing topics where that had at least one adjusted mutual information score of greater tan 0.025.
Fig 3
Fig 3. The effect of injected outbreaks on topic counts.
Simulated outbreaks were injected for pruritus, trauma, gastroenteric, tumour, respiratory and kidney disease with signals visible for gastroenteric MPC (topic_17), respiratory MPV (topic_6) and trauma MPC (topic 26).
Fig 4
Fig 4. Comparison of topic counts vs MPC counts over the course of the natural disease outbreak (November 2019 to March 2020).
(A) Daily count of narratives matching individual topics (where average daily consult count exceeds 10) were plotted as rolling 7-day average. (B) Counts for narratives annotated by topics and selected MPCs, normalised to the average count for that topic, plotted as rolling 7-day average. Only topic 16 counts match the gastroenteric MPC over the course of the natural disease outbreak.
Fig 5
Fig 5. Comparison of topic17 and gastroenteric MPC narrative counts.
(A) Comparison of frequency distributions of topic 17 narratives and gastroenteric MPC narratives in the year preceding the outbreak. (B) Comparison of frequency distributions of topic 17 narratives and gastroenteric MPC narratives during the outbreak. (C) and (D) Temporal pattern of topic 17 (C) and gastroenteric MPC (D) narrative counts. Red points represent the extreme outliers (outside the 99 per cent credible interval [CI]), orange points represent the moderate outliers (outside the 95 per cent CI but within the 99 per cent CI), and green points represent the average trend (within the 95 per cent CI).
Fig 6
Fig 6. Analysis of topic content.
(A) Topic representations as word clouds. (B) Analysis using LDAVis of topic separation and frequency within the corpus (left panel) and the key terms contributing to the topic 17 narrative identifying words and abbreviations related to gastrointestinal function (right panel).
Fig 7
Fig 7. Dynamic topic modelling.
The evolution of token-weighting within the gastroenteric topic showed increasing weighting for vomiting-related terms.

Similar articles

Cited by

References

    1. Dórea FC, Vial F. Animal health syndromic surveillance: a systematic literature review of the progress in the last 5 years (2011–2016). Vet Med Res Reports. 2016. - PMC - PubMed
    1. Smith S, Elliot AJ, Mallaghan C, Modha D, Hippisley-Cox J, Large S, et al.. Value of syndromic surveillance in monitoring a focal waterborne outbreak due to an unusual cryptosporidium genotype in Northamptonshire, United Kingdom, June-July 2008. Eurosurveillance. 2010;15(33):1–9. - PubMed
    1. Singleton DA, Noble PJ, Radford AD, Brant B, Pinchbeck GL, Greenberg D, et al.. Prolific vomiting in dogs. Vol. 186, Veterinary Record. British Veterinary Association; 2020. p. 191. - PubMed
    1. Radford A, Singleton D, Jewell C, Appleton C, Rowlingson B, Hale A, et al.. A national outbreak of severe vomiting in dogs associated with a canine enteric coronavirus. Emerg Infect Dis. 2021;27(2). - PMC - PubMed
    1. Smith SL, Singleton DA, Noble PJ, Radford AD, Brant B, Pinchbeck GL, et al.. Possible cause of outbreak of prolific vomiting in dogs. Vet Rec. 2020;186(10):324. doi: 10.1136/vr.m972 - DOI - PubMed

Publication types