Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Sep-Oct;19(5):824-32.
doi: 10.1136/amiajnl-2011-000776. Epub 2012 May 14.

Feature engineering combined with machine learning and rule-based methods for structured information extraction from narrative clinical discharge summaries

Affiliations

Feature engineering combined with machine learning and rule-based methods for structured information extraction from narrative clinical discharge summaries

Yan Xu et al. J Am Med Inform Assoc. 2012 Sep-Oct.

Abstract

Objective: A system that translates narrative text in the medical domain into structured representation is in great demand. The system performs three sub-tasks: concept extraction, assertion classification, and relation identification.

Design: The overall system consists of five steps: (1) pre-processing sentences, (2) marking noun phrases (NPs) and adjective phrases (APs), (3) extracting concepts that use a dosage-unit dictionary to dynamically switch two models based on Conditional Random Fields (CRF), (4) classifying assertions based on voting of five classifiers, and (5) identifying relations using normalized sentences with a set of effective discriminating features.

Measurements: Macro-averaged and micro-averaged precision, recall and F-measure were used to evaluate results.

Results: The performance is competitive with the state-of-the-art systems with micro-averaged F-measure of 0.8489 for concept extraction, 0.9392 for assertion classification and 0.7326 for relation identification.

Conclusions: The system exploits an array of common features and achieves state-of-the-art performance. Prudent feature engineering sets the foundation of our systems. In concept extraction, we demonstrated that switching models, one of which is especially designed for telegraphic sentences, improved extraction of the treatment concept significantly. In assertion classification, a set of features derived from a rule-based classifier were proven to be effective for the classes such as conditional and possible. These classes would suffer from data scarcity in conventional machine-learning methods. In relation identification, we use two-staged architecture, the second of which applies pairwise classifiers to possible candidate classes. This architecture significantly improves performance.

PubMed Disclaimer

Conflict of interest statement

Competing interests: None.

Figures

Figure 1
Figure 1
A flow diagram of the overall system.
Figure 2
Figure 2
A detailed overall flow chart of the system. CRF, conditional random field; POS, part of speech; SVM, support vector machine.
Figure 3
Figure 3
The results from the top 10 systems and our system for the concept task. On the abscissa, the numbers 1–10 denote the top 10 systems and the number 11 denotes our improved system.

References

Publication types

LinkOut - more resources