Automated encoding of clinical documents based on natural language processing

Carol Friedman¹, Lyudmila Shagina, Yves Lussier, George Hripcsak

Affiliations

PMID: 15187068
PMCID: PMC516246
DOI: 10.1197/jamia.M1552

Automated encoding of clinical documents based on natural language processing

Carol Friedman et al. J Am Med Inform Assoc. 2004 Sep-Oct.

. 2004 Sep-Oct;11(5):392-402.

doi: 10.1197/jamia.M1552. Epub 2004 Jun 7.

Authors

Carol Friedman¹, Lyudmila Shagina, Yves Lussier, George Hripcsak

Affiliation

¹ Department of Biomedical Informatics, Columbia University, 622 West 168 Street, VC-5, New York, NY 10032, USA. friedman@dbmi.columbia.edu

PMID: 15187068
PMCID: PMC516246
DOI: 10.1197/jamia.M1552

Abstract

Objective: The aim of this study was to develop a method based on natural language processing (NLP) that automatically maps an entire clinical document to codes with modifiers and to quantitatively evaluate the method.

Methods: An existing NLP system, MedLEE, was adapted to automatically generate codes. The method involves matching of structured output generated by MedLEE consisting of findings and modifiers to obtain the most specific code. Recall and precision applied to Unified Medical Language System (UMLS) coding were evaluated in two separate studies. Recall was measured using a test set of 150 randomly selected sentences, which were processed using MedLEE. Results were compared with a reference standard determined manually by seven experts. Precision was measured using a second test set of 150 randomly selected sentences from which UMLS codes were automatically generated by the method and then validated by experts.

Results: Recall of the system for UMLS coding of all terms was .77 (95% CI.72-.81), and for coding terms that had corresponding UMLS codes recall was .83 (.79-.87). Recall of the system for extracting all terms was .84 (.81-.88). Recall of the experts ranged from .69 to .91 for extracting terms. The precision of the system was .89 (.87-.91), and precision of the experts ranged from .61 to .91.

Conclusion: Extraction of relevant clinical information and UMLS coding were accomplished using a method based on NLP. The method appeared to be comparable to or better than six experts. The advantage of the method is that it maps text to codes along with other related information, rendering the coded output suitable for effective retrieval.

PubMed Disclaimer

Figures

**Figure 1.**
Overview of components of MedLEE. The knowledge-based components are shown as ovals; the processing engines are shown as rectangles. The new work discussed in this report involves the final stages of processing, the encoding process, which occurs after structured output is obtained.

**Figure 2.**
Process for creating a structured coding table of clinical information that is required for encoding of documents.

**Figure 3.**
Simplified form of the coding table. The actual table has additional fields to take advantage of indexing for improved efficiency when searching for matches. Several sample entries in the coding table are shown for some UMLS concepts associated with myocardial infarction. The first field consists of the CUI followed by the preferred former variant, which is provided for ease of readability. The second field consists of the structure obtained by MedLEE as a result of parsing the input terms in the table creation phase.

**Figure 4.**
XML output generated by MedLEE for *He was status post an anterolateral myocardial infarction*. Modifiers are indented in this figure for readability, but in the actual XML output form, they are not indented. The structured component is enclosed within the **structured** tag, and the original text is enclosed within the tt tag. In addition, the sentence and terms are also tagged.

**Figure 5.**
Screen shot of the graphic user interface of the application used by experts to evaluate the precision of MedLEE's UMLS coding. This view was obtained by clicking on the code **C0745830** in the right window. As a result, the information in the report that was used to obtain the code is highlighted in the window on the left.

**Figure 6.**
The first four recall measures are for the system, and the remaining six (S1-S6) for the six subjects. “Coded 1” represents the performance of MedLEE coding when considering all terms in the reference standard. “Coded 2” represents performance for coding terms in the reference standard that have corresponding UMLS codes. “Parsed 1” represents extraction performance (i.e., extracting the term from the text but not necessarily mapping it to a UMLS code) for all terms, “Parsed 2” represents the extraction performance of terms that have existing UMLS codes.

See this image and copyright information in PMC

References

1. Tange HJ, Schouten HC, Kester AD, Hasman A. The granularity of medical narratives and its effect on the speed and completeness of information retrieval. J Am Med Inform Assoc. 1998; 5:571–82. - PMC - PubMed
1. Dolin RH, Alschuler L, Beebe C, et al. The HL7 clnical document architecture. J Am Med Inform Assoc. 2001; 8:552–69. - PMC - PubMed
1. Elkins JS, Friedman C, Boden-Albala B, Sacco RL, Hripcsak G. Coding neuroradiology reports for the Northern Manhattan Stroke study: a comparison of natural language processing and manual review. Comput Biomed Res. 2000; 33(1):1–10. - PubMed
1. Tuttle MS, Olsen NE, Keck KD, Cole WG, et al. Metaphrase: an aid to the clinical conceptualization and formalization of patient problems in healthcare enterprises. Meth Inform Med. 1998; 37:373–83. - PubMed
1. Cooper GF, Miller RA. An experiment comparing lexical and statistical method for extracting MeSH terms from clinical free text. J Am Med Inform Assoc. 1998; 5:62–75. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Automated encoding of clinical documents based on natural language processing

Affiliation

Automated encoding of clinical documents based on natural language processing

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources