Validation of a Natural Language Processing Algorithm for Detecting Infectious Disease Symptoms in Primary Care Electronic Medical Records in Singapore

Antony Hardjojo¹, Arunan Gunachandran¹, Long Pang¹, Mohammed Ridzwan Bin Abdullah¹, Win Wah¹, Joash Wen Chen Chong¹, Ee Hui Goh¹, Sok Huang Teo², Gilbert Lim³, Mong Li Lee³, Wynne Hsu³, Vernon Lee¹, Mark I-Cheng Chen^{1

4}, Franco Wong^{2

5}, Jonathan Siung King Phang^{2

5}

Affiliations

¹ Saw Swee Hock School of Public Health, National University Health System, National University of Singapore, Singapore, Singapore.
² National Healthcare Group Polyclinics, Singapore, Singapore.
³ School of Computing, National University of Singapore, Singapore, Singapore.
⁴ National Centre for Infectious Diseases, Singapore, Singapore.
⁵ National University Polyclinics, Singapore, Singapore.

PMID: 29907560
PMCID: PMC6026305
DOI: 10.2196/medinform.8204

Validation of a Natural Language Processing Algorithm for Detecting Infectious Disease Symptoms in Primary Care Electronic Medical Records in Singapore

Antony Hardjojo et al. JMIR Med Inform. 2018.

. 2018 Jun 11;6(2):e36.

doi: 10.2196/medinform.8204.

Authors

Affiliations

¹ Saw Swee Hock School of Public Health, National University Health System, National University of Singapore, Singapore, Singapore.
² National Healthcare Group Polyclinics, Singapore, Singapore.
³ School of Computing, National University of Singapore, Singapore, Singapore.
⁴ National Centre for Infectious Diseases, Singapore, Singapore.
⁵ National University Polyclinics, Singapore, Singapore.

PMID: 29907560
PMCID: PMC6026305
DOI: 10.2196/medinform.8204

Abstract

Background: Free-text clinical records provide a source of information that complements traditional disease surveillance. To electronically harness these records, they need to be transformed into codified fields by natural language processing algorithms.

Objective: The aim of this study was to develop, train, and validate Clinical History Extractor for Syndromic Surveillance (CHESS), an natural language processing algorithm to extract clinical information from free-text primary care records.

Methods: CHESS is a keyword-based natural language processing algorithm to extract 48 signs and symptoms suggesting respiratory infections, gastrointestinal infections, constitutional, as well as other signs and symptoms potentially associated with infectious diseases. The algorithm also captured the assertion status (affirmed, negated, or suspected) and symptom duration. Electronic medical records from the National Healthcare Group Polyclinics, a major public sector primary care provider in Singapore, were randomly extracted and manually reviewed by 2 human reviewers, with a third reviewer as the adjudicator. The algorithm was evaluated based on 1680 notes against the human-coded result as the reference standard, with half of the data used for training and the other half for validation.

Results: The symptoms most commonly present within the 1680 clinical records at the episode level were those typically present in respiratory infections such as cough (744/7703, 9.66%), sore throat (591/7703, 7.67%), rhinorrhea (552/7703, 7.17%), and fever (928/7703, 12.04%). At the episode level, CHESS had an overall performance of 96.7% precision and 97.6% recall on the training dataset and 96.0% precision and 93.1% recall on the validation dataset. Symptoms suggesting respiratory and gastrointestinal infections were all detected with more than 90% precision and recall. CHESS correctly assigned the assertion status in 97.3%, 97.9%, and 89.8% of affirmed, negated, and suspected signs and symptoms, respectively (97.6% overall accuracy). Symptom episode duration was correctly identified in 81.2% of records with known duration status.

Conclusions: We have developed an natural language processing algorithm dubbed CHESS that achieves good performance in extracting signs and symptoms from primary care free-text clinical records. In addition to the presence of symptoms, our algorithm can also accurately distinguish affirmed, negated, and suspected assertion statuses and extract symptom durations.

Keywords: communicable diseases; electronic health records; epidemiology; natural language processing; surveillance; syndromic surveillance.

©Antony Hardjojo, Arunan Gunachandran, Long Pang, Mohammed Ridzwan Bin Abdullah, Win Wah, Joash Wen Chen Chong, Ee Hui Goh, Sok Huang Teo, Gilbert Lim, Mong Li Lee, Wynne Hsu, Vernon Lee, Mark I-Cheng Chen, Franco Wong, Jonathan Siung King Phang. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 11.06.2018.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: None declared.

Figures

**Figure 1**
Ontology and grammar-based analysis of the rule-based natural language processing (NLP) algorithm. Signs and symptoms and information on assertion status and duration are captured and tokenized in the ontology analysis. Relationships between tokens are built up in the grammar-based analysis. C/o: complain of; ST; sore throat.

**Figure 2**
Sample set of clinical notes and transformation following phrase-level manual coding and episode-level coding. Abd: abdominal; NA: not applicable; NKDA: no known drug allergy; PMHX: past medical history; RIF: right iliac fossa.

**Figure 3**
Flowchart of process for creating reference standard.

**Figure 4**
Bubble chart of the Clinical History Extractor for Syndromic Surveillance’s (CHESS’s) precision and recall for each sign and symptom in episode level analysis for the validation dataset. Each bubble denotes a single symptom categorized into symptom types: respiratory, gastrointestinal, constitutional, and others. Bubble size is proportional to the number of cases identified by humans (true positive + false negative). Symptoms present in less than 1% of records are not presented.

**Figure 5**
Clinical History Extractor for Syndromic Surveillance’s (CHESS’s) accuracy in identifying assertion status of symptoms within episode level analysis based on the validation dataset.

**Figure 6**
Episode level analysis on the distribution of symptom episode duration in instances detected by human coders (blue) among all the National Healthcare Group Polyclinics (NHGP) records and the distribution of durations detected by Clinical History Extractor for Syndromic Surveillance (CHESS; red) based on the validation dataset. Diamonds give the proportion of records where CHESS correctly identifies and assigns the duration information stratified by episode duration (based on the reference standard), with the horizontal line giving the aggregated accuracy for detection of symptom duration for all records analyzed.

See this image and copyright information in PMC

References

1. Hope K, Durrheim DN, d'Espaignet ET, Dalton C. Syndromic surveillance: is it a useful tool for local outbreak detection? J Epidemiol Community Health. 2006 May;60(5):374–5. http://europepmc.org/abstract/MED/16680907 - PMC - PubMed
1. Ministry of Health, Singapore 2018. [2018-02-05]. Communicable Diseases Surveillance in Singapore 2016 https://www.moh.gov.sg/content/moh_web/home/Publications/Reports/2017/co... .
1. Morse SS. Public health surveillance and infectious disease detection. Biosecur Bioterror. 2012 Mar;10(1):6–16. doi: 10.1089/bsp.2011.0088. doi: 10.1089/bsp.2011.0088. - DOI - DOI - PubMed
1. Levin JE, Raman S. Early detection of rotavirus gastrointestinal illness outbreaks by multiple data sources and detection algorithms at a pediatric health system. AMIA Annu Symp Proc. 2005:445–9. http://europepmc.org/abstract/MED/16779079 56946 - PMC - PubMed
1. Buehler JW, Hopkins RS, Overhage JM, Sosin DM, Tong V, CDC Working Group Framework for evaluating public health surveillance systems for early detection of outbreaks: recommendations from the CDC Working Group. MMWR Recomm Rep. 2004 May 7;53(RR-5):1–11. http://www.cdc.gov/mmwr/preview/mmwrhtml/rr5305a1.htm rr5305a1 - PubMed

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Validation of a Natural Language Processing Algorithm for Detecting Infectious Disease Symptoms in Primary Care Electronic Medical Records in Singapore

Affiliations

Validation of a Natural Language Processing Algorithm for Detecting Infectious Disease Symptoms in Primary Care Electronic Medical Records in Singapore

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources

Other Literature Sources