Using topic modelling for unsupervised annotation of electronic health records to identify an outbreak of disease in UK dogs
- PMID: 34882714
- PMCID: PMC8659617
- DOI: 10.1371/journal.pone.0260402
Using topic modelling for unsupervised annotation of electronic health records to identify an outbreak of disease in UK dogs
Abstract
A key goal of disease surveillance is to identify outbreaks of known or novel diseases in a timely manner. Such an outbreak occurred in the UK associated with acute vomiting in dogs between December 2019 and March 2020. We tracked this outbreak using the clinical free text component of anonymised electronic health records (EHRs) collected from a sentinel network of participating veterinary practices. We sourced the free text (narrative) component of each EHR supplemented with one of 10 practitioner-derived main presenting complaints (MPCs), with the 'gastroenteric' MPC identifying cases involved in the disease outbreak. Such clinician-derived annotation systems can suffer from poor compliance requiring retrospective, often manual, coding, thereby limiting real-time usability, especially where an outbreak of a novel disease might not present clinically as a currently recognised syndrome or MPC. Here, we investigate the use of an unsupervised method of EHR annotation using latent Dirichlet allocation topic-modelling to identify topics inherent within the clinical narrative component of EHRs. The model comprised 30 topics which were used to annotate EHRs spanning the natural disease outbreak and investigate whether any given topic might mirror the outbreak time-course. Narratives were annotated using the Gensim Library LdaModel module for the topic best representing the text within them. Counts for narratives labelled with one of the topics significantly matched the disease outbreak based on the practitioner-derived 'gastroenteric' MPC (Spearman correlation 0.978); no other topics showed a similar time course. Using artificially injected outbreaks, it was possible to see other topics that would match other MPCs including respiratory disease. The underlying topics were readily evaluated using simple word-cloud representations and using a freely available package (LDAVis) providing rapid insight into the clinical basis of each topic. This work clearly shows that unsupervised record annotation using topic modelling linked to simple text visualisations can provide an easily interrogable method to identify and characterise outbreaks and other anomalies of known and previously un-characterised diseases based on changes in clinical narratives.
Conflict of interest statement
The authors have declared that no competing interests exist.
Figures







Similar articles
-
Outbreak of Severe Vomiting in Dogs Associated with a Canine Enteric Coronavirus, United Kingdom.Emerg Infect Dis. 2021 Feb;27(2):517-528. doi: 10.3201/eid2702.202452. Emerg Infect Dis. 2021. PMID: 33496240 Free PMC article.
-
Emerging Variants of Canine Enteric Coronavirus Associated with Outbreaks of Gastroenteric Disease.Emerg Infect Dis. 2024 Jun;30(6):1240-1244. doi: 10.3201/eid3006.231184. Emerg Infect Dis. 2024. PMID: 38782018 Free PMC article.
-
Surveillance of a vomiting outbreak in dogs in the UK using owner-derived and internet search data.Vet Rec. 2021 Nov;189(9):e308. doi: 10.1002/vetr.308. Epub 2021 May 18. Vet Rec. 2021. PMID: 34008199
-
Clinical Text Data in Machine Learning: Systematic Review.JMIR Med Inform. 2020 Mar 31;8(3):e17984. doi: 10.2196/17984. JMIR Med Inform. 2020. PMID: 32229465 Free PMC article. Review.
-
Astroviruses in dogs.Vet Clin North Am Small Anim Pract. 2011 Nov;41(6):1087-95. doi: 10.1016/j.cvsm.2011.09.001. Vet Clin North Am Small Anim Pract. 2011. PMID: 22041205 Free PMC article. Review.
Cited by
-
Secure latent Dirichlet allocation.Front Digit Health. 2025 Jul 24;7:1610228. doi: 10.3389/fdgth.2025.1610228. eCollection 2025. Front Digit Health. 2025. PMID: 40778383 Free PMC article.
-
A GPT-based EHR modeling system for unsupervised novel disease detection.J Biomed Inform. 2024 Sep;157:104706. doi: 10.1016/j.jbi.2024.104706. Epub 2024 Aug 8. J Biomed Inform. 2024. PMID: 39121932
-
Text mining for disease surveillance in veterinary clinical data: part two, training computers to identify features in clinical text.Front Vet Sci. 2024 Aug 22;11:1352726. doi: 10.3389/fvets.2024.1352726. eCollection 2024. Front Vet Sci. 2024. PMID: 39239390 Free PMC article.
-
PetBERT: automated ICD-11 syndromic disease coding for outbreak detection in first opinion veterinary electronic health records.Sci Rep. 2023 Oct 21;13(1):18015. doi: 10.1038/s41598-023-45155-7. Sci Rep. 2023. PMID: 37865683 Free PMC article.
-
Mapping the Bibliometrics Landscape of AI in Medicine: Methodological Study.J Med Internet Res. 2023 Dec 8;25:e45815. doi: 10.2196/45815. J Med Internet Res. 2023. PMID: 38064255 Free PMC article.
References
-
- Smith S, Elliot AJ, Mallaghan C, Modha D, Hippisley-Cox J, Large S, et al.. Value of syndromic surveillance in monitoring a focal waterborne outbreak due to an unusual cryptosporidium genotype in Northamptonshire, United Kingdom, June-July 2008. Eurosurveillance. 2010;15(33):1–9. - PubMed
-
- Singleton DA, Noble PJ, Radford AD, Brant B, Pinchbeck GL, Greenberg D, et al.. Prolific vomiting in dogs. Vol. 186, Veterinary Record. British Veterinary Association; 2020. p. 191. - PubMed
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Medical