Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 1999 Jan-Feb;6(1):76-87.
doi: 10.1136/jamia.1999.0060076.

Representing information in patient reports using natural language processing and the extensible markup language

Affiliations

Representing information in patient reports using natural language processing and the extensible markup language

C Friedman et al. J Am Med Inform Assoc. 1999 Jan-Feb.

Abstract

Objective: To design a document model that provides reliable and efficient access to clinical information in patient reports for a broad range of clinical applications, and to implement an automated method using natural language processing that maps textual reports to a form consistent with the model.

Methods: A document model that encodes structured clinical information in patient reports while retaining the original contents was designed using the extensible markup language (XML), and a document type definition (DTD) was created. An existing natural language processor (NLP) was modified to generate output consistent with the model. Two hundred reports were processed using the modified NLP system, and the XML output that was generated was validated using an XML validating parser.

Results: The modified NLP system successfully processed all 200 reports. The output of one report was invalid, and 199 reports were valid XML forms consistent with the DTD.

Conclusions: Natural language processing can be used to automatically create an enriched document that contains a structured component whose elements are linked to portions of the original textual report. This integrated document model provides a representation where documents containing specific information can be accurately and efficiently retrieved by querying the structured components. If manual review of the documents is desired, the salient information in the original reports can also be identified and highlighted. Using an XML model of tagging provides an additional benefit in that software tools that manipulate XML documents are readily available.

PubMed Disclaimer

Figures

Figure 1
Figure 1
An extended markup language (XML) tagging scheme for address. The address tag contains embedded tags for street, city, state, and zip code. Similarly, the street tag contains embedded tags for number and street name. This scheme facilitates searching for specific information in certain parts of the address, such as a particular zip code.
Figure 2
Figure 2
The document type definition (DTD) of a clinical report (medleeOut) generated by MedLEE contains sections that consist of two components—a structured component structured containing structured data, and a tagged textual component tt.
Figure 3
Figure 3
An example of the structured component of the output form generated by MedLEE. Two tags correspond to the informational type problem. One has the value pain with a reference to identifier p2 along with other modifiers, which also have values and identifiers. The second has the value swelling and references identifier p13; it also has modifiers certainty, body location (bodyloc), and sentence identifier (sid), whose value is a triple identifying the section number, paragraph number in the section, and sentence number in the paragraph. The identifiers are shown in ▶, which illustrates the tagged text component, tt.
Figure 4
Figure 4
The tagged text component is embedded in the tag tt. It contains the original report augmented with tags that delineate sections, sentences, and phrases of a report. The tags have attributes id, whose values are unique identifiers for that component. For brevity, phr tags are not shown for phrases that are not referenced by the corresponding structured component.
Figure 5
Figure 5
Tagged representation of the structured and tagged components for the sentence the spleen and liver appear to be moderately enlarged. The values of the id attributes of the tag phr are based on the assumption that the sentence appears at the beginning of the report, so that the first word of the sentence, the, is assigned a position 1. The attribute idref for splenomegaly has two values that reference the individual components enlarged and spleen, which constitute the concept splenomegaly.
Figure 6
Figure 6
Overview of components of MedLEE. There are five processing phases and four knowledge base components. The first phase of processing is the preprocessor. It determines sentence boundaries and performs lexical lookup. The parsing uses the grammar to determine the structure of the sentence and to generate an intermediate form. The regularization phase composes multiword terms, and the encoding phase maps the output to controlled vocabulary terms.
Figure 7
Figure 7
Output showing the description section of a radiologic report associated with the clinical condition congestive heart failure, where terms associated with congestive heart failure are highlighted. The report was retrieved and highlighted using a JAVA program and structured output generated by MedLEE. The identifiers corresponding to the structured findings associated with the condition were used to highlight the appropriate phrases in the textual report.

References

    1. Sager N, Lyman M, Buchnall C, Nhan N, Tick L. Natural language processing and the representation of clinical data. J Am Med Inform Assoc. 1994;1(2):142-60. - PMC - PubMed
    1. Friedman C, Alderson PO, Austin J, Cimino JJ, Johnson SB. A general natural language text processor for clinical radiology. J Am Med Inform Assoc. 1994;1(2):161-74. - PMC - PubMed
    1. Hripcsak G, Friedman C, Alderson P, DuMouchel W, Johnson S, Clayton P. Unlocking clinical data from narrative reports. Ann Intern Med. 1995;122(9):681-8. - PubMed
    1. Haug P, Ranum D, Frederick P. Computerized extraction of coded findings from free-text radiologic report. Radiology. 1990;174:543-8. - PubMed
    1. Zweigenbaum P, Bachimont B, Bouaud J, Charlet J, Boisvieux J. A multilingual architecture for building a normalized conceptual representation from medical language. Proc 19th Annu Symp Comput Appl Med Care. 1995:357-61. - PMC - PubMed

Publication types

MeSH terms