Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Aug 11;2(3):e27017.
doi: 10.2196/27017.

Finding Potential Adverse Events in the Unstructured Text of Electronic Health Care Records: Development of the Shakespeare Method

Affiliations

Finding Potential Adverse Events in the Unstructured Text of Electronic Health Care Records: Development of the Shakespeare Method

Roselie A Bright et al. JMIRx Med. .

Abstract

Background: Big data tools provide opportunities to monitor adverse events (patient harm associated with medical care) (AEs) in the unstructured text of electronic health care records (EHRs). Writers may explicitly state an apparent association between treatment and adverse outcome ("attributed") or state the simple treatment and outcome without an association ("unattributed"). Many methods for finding AEs in text rely on predefining possible AEs before searching for prespecified words and phrases or manual labeling (standardization) by investigators. We developed a method to identify possible AEs, even if unknown or unattributed, without any prespecifications or standardization of notes. Our method was inspired by word-frequency analysis methods used to uncover the true authorship of disputed works credited to William Shakespeare. We chose two use cases, "transfusion" and "time-based." Transfusion was chosen because new transfusion AE types were becoming recognized during the study data period; therefore, we anticipated an opportunity to find unattributed potential AEs (PAEs) in the notes. With the time-based case, we wanted to simulate near real-time surveillance. We chose time periods in the hope of detecting PAEs due to contaminated heparin from mid-2007 to mid-2008 that were announced in early 2008. We hypothesized that the prevalence of contaminated heparin may have been widespread enough to manifest in EHRs through symptoms related to heparin AEs, independent of clinicians' documentation of attributed AEs.

Objective: We aimed to develop a new method to identify attributed and unattributed PAEs using the unstructured text of EHRs.

Methods: We used EHRs for adult critical care admissions at a major teaching hospital (2001-2012). For each case, we formed a group of interest and a comparison group. We concatenated the text notes for each admission into one document sorted by date, and deleted replicate sentences and lists. We identified statistically significant words in the group of interest versus the comparison group. Documents in the group of interest were filtered to those words, followed by topic modeling on the filtered documents to produce topics. For each topic, the three documents with the maximum topic scores were manually reviewed to identify PAEs.

Results: Topics centered around medical conditions that were unique to or more common in the group of interest, including PAEs. In each use case, most PAEs were unattributed in the notes. Among the transfusion PAEs was unattributed evidence of transfusion-associated cardiac overload and transfusion-related acute lung injury. Some of the PAEs from mid-2007 to mid-2008 were increased unattributed events consistent with AEs related to heparin contamination.

Conclusions: The Shakespeare method could be a useful supplement to AE reporting and surveillance of structured EHR data. Future improvements should include automation of the manual review process.

Keywords: big data; critical care; electronic health care record; electronic health record; epidemiology; natural language processing; patient harm; patient safety; product surveillance, postmarketing; proof-of-concept study; public health.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: The research was done with FDA support and under contract HHSF223201510027B between the FDA and Booz Allen Hamilton Inc. None of the authors have other relevant financial interests. The opinions presented in this paper are those of the authors and do not represent official policy of either the FDA or Booz Allen Hamilton.

Figures

Figure 1
Figure 1
The Shakespeare method process with truncated examples. Step 1 (create n-gram vectors) includes (A) n-grams (terms) and (B) form vectors. Step 2 (create two groups) is (C) form groups. Step 3 (extract significant terms) is (D) extracted terms and (E) trim vectors in the group of interest. Step 4 (model topics) includes (F) latent Dirichlet allocation (LDA) topic modeling and (G) topics to documents. Step 5 (review topics) includes (H) identification of exceptional instances.
Figure 2
Figure 2
Flowchart of the embedded-based and filter-based term selection processes for the transfusion case. T: transfusion, C: comparison.
Figure 3
Figure 3
Topic-modeling results for the transfusion case (T): (A) distribution of all maximum document topic scores for all T, (B) documents that have only one strong topic, (C) documents that have many topics, (D) all topic scores for a single document that has multiple topics, and (E) two documents with a score of 0.022 for every topic.
Figure 4
Figure 4
Distribution of topic document scores and top term scores for the transfusion case.
Figure 5
Figure 5
Distribution of document topic scores for two topics in the transfusion case: (a) topic 8, a noncoherent topic, and (b) topic 42, a coherent topic.
Figure 6
Figure 6
Feature extraction flowchart for the time-based case. This demonstrates the two parallel processes for extracting relevant features prior to topic modeling on the notes: term frequency analysis and binary classification of notes.
Figure 7
Figure 7
Heparin and hypotension for the time-based case (see Table S4 [Multimedia Appendix 1] for search criteria details). (A) Invasive cardiology-, heparin-, and hypotension-related criteria as a proportion of all admissions. Invasive cardiology is presumed to involve heparin treatment. For invasive cardiovascular procedure code, slope=–0.0053 (95% CI –0.0069 to –0.0037), P<.001; for heparin word, slope=0.0039 (95% CI 0.0025-0.0054), P<.001; and for hypotension word, slope=0.0029 (95% CI 0.0017-0.0040), P<.001. (B) The word “hypotension” as a proportion of presumed heparin exposure. For the proportion of any invasive cardiovascular procedure code (presumed to involve heparin), slope=0.0055 (95% CI 0.0038-0.0072), P<.001. For the proportion of those with “heparin,” slope=0.0013 (95% CI –0.00036 to 0.0030), P=.12.
Figure 8
Figure 8
Trauma code, word, or both as a proportion of all admissions by quarter for the time-based case (see Table S4 [Multimedia Appendix 1] for search criteria details). For the proportion of trauma code, slope=0.0022 (95% CI 0.0014-0.0030), P<.001. For the proportion of the word “trauma,” slope=0.0057 (95% CI 0.0047-0.0067), P<.001. For the proportion with both trauma code and word, slope=0.0019 (95% CI 0.0012-0.0027), P<.001.
Figure 9
Figure 9
Brain ischemia codes or text words for (A) bleeding, (B) ischemia, and (C) trauma, as a proportion of all admissions by quarter for the time-based case (see Table S4 [Multimedia Appendix 1] for search criteria details). For brain bleed code, slope=0.00022 (95% CI –0.0006 to 0.0010), P=.61. For brain word and brain bleed word, slope=0.00039 (95% CI 0-0.00085), P=.10. For brain ischemia code, slope=0.00019 (95% CI 0.00051-0.0013), P<.001. For brain word and “occlusion*,” slope=0 (95% CI –0.00064 to 0.00080), P=.84. For brain trauma code, slope=0.0013 (95% CI 0.00073-0.0018), P<.001. For brain word and “trauma,” slope=0.0021 (95% CI 0.0014-0.0028), P<.001.
Figure 10
Figure 10
Excess draining from postsurgical wounds as a proportion of all admissions by quarter for the time-based case (see Table S4 [Multimedia Appendix 1] for search criteria details). For leaky surgical wound code, slope=0.000027 (95% CI –0.000028 to 0.000082), P=.34. For leaky surgical wound word and long stay, slope=0.0018 (95% CI 0.0012-0.0024), P<.001. For wound catheter word and long stay, slope=0.00038 (95% CI –0.00039 to 0.0012), P=.34. For leaky surgical wound word and wound catheter word and long stay, slope=0.0011 (95% CI 0.00071-0.0016), P<.001.
Figure 11
Figure 11
Allergy, anaphylaxis, and adverse effect (AE) as a proportion of admissions by quarter for the time-based case (see Table S4 [Multimedia Appendix 1] for search criteria details). For allergy or anaphylaxis word, slope=–0.0022 (95% CI –0.0027 to –0.0018), P<.001. For drug AE code, slope=0.00031 (95% CI –0.000079 to 0.00070), P=.12. For surgery or medical AE code, slope=0.00049 (95% CI –0.00022 to 0.0012), P=.18.

Update of

  • https://www.medrxiv.org/content/10.1101/2021.01.05.21249239v1
  • JMIRx Med. 2:e27017.

Similar articles

References

    1. Brewer T, Colditz GA. Postmarketing surveillance and adverse drug reactions: current perspectives and future needs. JAMA. 1999 Mar 03;281(9):824–9. doi: 10.1001/jama.281.9.824.jsc80012 - DOI - PubMed
    1. Scott HD, Thacher-Renshaw A, Rosenbaum S E, Waters W J, Green M, Andrews L G, Faich G A. Physician reporting of adverse drug reactions. Results of the Rhode Island Adverse Drug Reaction Reporting Project. JAMA. 1990 Apr 04;263(13):1785–8. - PubMed
    1. Bright RA, Nelson RC. Automated support for pharmacovigilance: a proposed system. Pharmacoepidemiol Drug Saf. 2002 Mar;11(2):121–5. doi: 10.1002/pds.684. - DOI - PubMed
    1. Samore MH, Evans RS, Lassen A, Gould P, Lloyd J, Gardner RM, Abouzelof R, Taylor C, Woodbury DA, Willy M, Bright RA. Surveillance of medical device-related hazards and adverse events in hospitalized patients. JAMA. 2004 Jan 21;291(3):325–34. doi: 10.1001/jama.291.3.325.291/3/325 - DOI - PubMed
    1. Bright RA. Strategy for surveillance of adverse drug events. Food Drug Law J. 2007;62(3):605–16. - PubMed

LinkOut - more resources