Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2004 Sep-Oct;11(5):392-402.
doi: 10.1197/jamia.M1552. Epub 2004 Jun 7.

Automated encoding of clinical documents based on natural language processing

Affiliations

Automated encoding of clinical documents based on natural language processing

Carol Friedman et al. J Am Med Inform Assoc. 2004 Sep-Oct.

Abstract

Objective: The aim of this study was to develop a method based on natural language processing (NLP) that automatically maps an entire clinical document to codes with modifiers and to quantitatively evaluate the method.

Methods: An existing NLP system, MedLEE, was adapted to automatically generate codes. The method involves matching of structured output generated by MedLEE consisting of findings and modifiers to obtain the most specific code. Recall and precision applied to Unified Medical Language System (UMLS) coding were evaluated in two separate studies. Recall was measured using a test set of 150 randomly selected sentences, which were processed using MedLEE. Results were compared with a reference standard determined manually by seven experts. Precision was measured using a second test set of 150 randomly selected sentences from which UMLS codes were automatically generated by the method and then validated by experts.

Results: Recall of the system for UMLS coding of all terms was .77 (95% CI.72-.81), and for coding terms that had corresponding UMLS codes recall was .83 (.79-.87). Recall of the system for extracting all terms was .84 (.81-.88). Recall of the experts ranged from .69 to .91 for extracting terms. The precision of the system was .89 (.87-.91), and precision of the experts ranged from .61 to .91.

Conclusion: Extraction of relevant clinical information and UMLS coding were accomplished using a method based on NLP. The method appeared to be comparable to or better than six experts. The advantage of the method is that it maps text to codes along with other related information, rendering the coded output suitable for effective retrieval.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Overview of components of MedLEE. The knowledge-based components are shown as ovals; the processing engines are shown as rectangles. The new work discussed in this report involves the final stages of processing, the encoding process, which occurs after structured output is obtained.
Figure 2.
Figure 2.
Process for creating a structured coding table of clinical information that is required for encoding of documents.
Figure 3.
Figure 3.
Simplified form of the coding table. The actual table has additional fields to take advantage of indexing for improved efficiency when searching for matches. Several sample entries in the coding table are shown for some UMLS concepts associated with myocardial infarction. The first field consists of the CUI followed by the preferred former variant, which is provided for ease of readability. The second field consists of the structure obtained by MedLEE as a result of parsing the input terms in the table creation phase.
Figure 4.
Figure 4.
XML output generated by MedLEE for He was status post an anterolateral myocardial infarction. Modifiers are indented in this figure for readability, but in the actual XML output form, they are not indented. The structured component is enclosed within the structured tag, and the original text is enclosed within the tt tag. In addition, the sentence and terms are also tagged.
Figure 5.
Figure 5.
Screen shot of the graphic user interface of the application used by experts to evaluate the precision of MedLEE's UMLS coding. This view was obtained by clicking on the code C0745830 in the right window. As a result, the information in the report that was used to obtain the code is highlighted in the window on the left.
Figure 6.
Figure 6.
The first four recall measures are for the system, and the remaining six (S1-S6) for the six subjects. “Coded 1” represents the performance of MedLEE coding when considering all terms in the reference standard. “Coded 2” represents performance for coding terms in the reference standard that have corresponding UMLS codes. “Parsed 1” represents extraction performance (i.e., extracting the term from the text but not necessarily mapping it to a UMLS code) for all terms, “Parsed 2” represents the extraction performance of terms that have existing UMLS codes.

References

    1. Tange HJ, Schouten HC, Kester AD, Hasman A. The granularity of medical narratives and its effect on the speed and completeness of information retrieval. J Am Med Inform Assoc. 1998; 5:571–82. - PMC - PubMed
    1. Dolin RH, Alschuler L, Beebe C, et al. The HL7 clnical document architecture. J Am Med Inform Assoc. 2001; 8:552–69. - PMC - PubMed
    1. Elkins JS, Friedman C, Boden-Albala B, Sacco RL, Hripcsak G. Coding neuroradiology reports for the Northern Manhattan Stroke study: a comparison of natural language processing and manual review. Comput Biomed Res. 2000; 33(1):1–10. - PubMed
    1. Tuttle MS, Olsen NE, Keck KD, Cole WG, et al. Metaphrase: an aid to the clinical conceptualization and formalization of patient problems in healthcare enterprises. Meth Inform Med. 1998; 37:373–83. - PubMed
    1. Cooper GF, Miller RA. An experiment comparing lexical and statistical method for extracting MeSH terms from clinical free text. J Am Med Inform Assoc. 1998; 5:62–75. - PMC - PubMed

Publication types

MeSH terms