. 2015 Sep;22(5):980-6.

doi: 10.1093/jamia/ocv012. Epub 2015 May 14.

Automating the generation of lexical patterns for processing free text in clinical documents

Frank Meng¹, Craig Morioka²

Affiliations

¹ Medical Imaging Informatics Group, Department of Radiological Sciences, University of California, Los Angeles, CA, USA MAVERIC, VA Boston Healthcare System, Boston MA, USA fmeng@mii.ucla.edu.
² Medical Imaging Informatics Group, Department of Radiological Sciences, University of California, Los Angeles, CA, USA Department of Radiology, VA Greater Los Angeles Healthcare System, Los Angeles CA, USA.

PMID: 25977405
PMCID: PMC4986670
DOI: 10.1093/jamia/ocv012

Automating the generation of lexical patterns for processing free text in clinical documents

Frank Meng et al. J Am Med Inform Assoc. 2015 Sep.

. 2015 Sep;22(5):980-6.

doi: 10.1093/jamia/ocv012. Epub 2015 May 14.

Authors

Frank Meng¹, Craig Morioka²

Affiliations

¹ Medical Imaging Informatics Group, Department of Radiological Sciences, University of California, Los Angeles, CA, USA MAVERIC, VA Boston Healthcare System, Boston MA, USA fmeng@mii.ucla.edu.
² Medical Imaging Informatics Group, Department of Radiological Sciences, University of California, Los Angeles, CA, USA Department of Radiology, VA Greater Los Angeles Healthcare System, Los Angeles CA, USA.

PMID: 25977405
PMCID: PMC4986670
DOI: 10.1093/jamia/ocv012

Abstract

Objective: Many tasks in natural language processing utilize lexical pattern-matching techniques, including information extraction (IE), negation identification, and syntactic parsing. However, it is generally difficult to derive patterns that achieve acceptable levels of recall while also remaining highly precise.

Materials and methods: We present a multiple sequence alignment (MSA)-based technique that automatically generates patterns, thereby leveraging language usage to determine the context of words that influence a given target. MSAs capture the commonalities among word sequences and are able to reveal areas of linguistic stability and variation. In this way, MSAs provide a systemic approach to generating lexical patterns that are generalizable, which will both increase recall levels and maintain high levels of precision.

Results: The MSA-generated patterns exhibited consistent F1-, F.5-, and F2- scores compared to two baseline techniques for IE across four different tasks. Both baseline techniques performed well for some tasks and less well for others, but MSA was found to consistently perform at a high level for all four tasks.

Discussion: The performance of MSA on the four extraction tasks indicates the method's versatility. The results show that the MSA-based patterns are able to handle the extraction of individual data elements as well as relations between two concepts without the need for large amounts of manual intervention.

Conclusion: We presented an MSA-based framework for generating lexical patterns that showed consistently high levels of both performance and recall over four different extraction tasks when compared to baseline methods.

Keywords: information extraction; natural language processing; text mining.

PubMed Disclaimer

Figures

**Figure 1:**
Determining the words that make up the context of a target is a nontrivial task. Using distance does not favor one context over the other for the example target.

**Figure 2:**
Pair-wise alignment of two sentences, where the light gray shaded boxes represent matched tokens (base tokens) while the dark gray shaded boxes represent tokens that have been inserted. White boxes indicate tokens that do not participate in the alignment.

**Figure 3:**
An example MSA generated from the common tokens of several different sentences that share the same target. The MSA clearly shows the areas of stability and variation among the different sentences.

**Figure 4:**
Matching a pattern with a sentence is a pair-wise alignment with the inclusion of wildcards that can be matched against any token. In addition, to be considered a successful match, all tokens within the pattern must correspond with a token in the.

**Figure 5:**
System workflow. Sentences containing targets are first identified from documents and a subsample is set aside for both generating and scoring patterns. Scoring is based on a manually annotated set of instances. The scored patterns are then matched against sentences containing yet-to-be-extracted targets.

**Figure 6:**
Extraction tasks. Task 1 extracts the relation between nodule types and anatomy locations, Task 2 extracts nodule sizes, Task 3 extracts the relation between sizes and anatomy locations, and Task 4 extracts the relation between sizes and dates. Some markings have been left out, to preserve clarity.

See this image and copyright information in PMC

Cited by

Named Entity Recognition in Prehospital Trauma Care.
Silverman GM, Lindemann EA, Rajamani G, Finzel RL, McEwan R, Knoll BC, Pakhomov S, Melton GB, Tignanelli CJ. Silverman GM, et al. Stud Health Technol Inform. 2019 Aug 21;264:1586-1587. doi: 10.3233/SHTI190547. Stud Health Technol Inform. 2019. PMID: 31438244 Free PMC article.
Clinical Natural Language Processing in 2015: Leveraging the Variety of Texts of Clinical Interest.
Névéol A, Zweigenbaum P. Névéol A, et al. Yearb Med Inform. 2016 Nov 10;(1):234-239. doi: 10.15265/IY-2016-049. Yearb Med Inform. 2016. PMID: 27830256 Free PMC article.
Enhanced Quality Measurement Event Detection: An Application to Physician Reporting.
Tamang SR, Hernandez-Boussard T, Ross EG, Gaskin G, Patel MI, Shah NH. Tamang SR, et al. EGEMS (Wash DC). 2017 May 30;5(1):5. doi: 10.13063/2327-9214.1270. EGEMS (Wash DC). 2017. PMID: 29881731 Free PMC article.

References

1. Chiticariu L, Li Y, Reiss F. Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems! EMNLP. 2013:827–832.
1. Yarowsky D. Unsupervised word sense disambiguation rivaling supervised methods. In: proceedings of the 33rd annual meeting on Association for Computational Linguistics (ACL '95). Association for Computational Linguistics; 1995:189–196; Stroudsburg, PA, USA
1. Ko Y. A study of term weighting schemes using class information for text classification. In: proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval (SIGIR '12). ACM; 2012:1029–1030; New York, NY, USA.
1. Carrillo M, López-López A. Concept based representations as complement of bag of words in information retrieval. AIAI, volume 339 of IFIP Advances in Information and Communication Technology, Springer; 2010:154–161.
1. Tandon N, de Melo G. Information Extraction from Web-Scale N-Gram Data (2010). In: Proc. Web N-gram Workshop at SIGIR 2010:59-63;Association for Computing Machinery (ACM).

Publication types

Actions

MeSH terms

Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Automating the generation of lexical patterns for processing free text in clinical documents

Affiliations

Automating the generation of lexical patterns for processing free text in clinical documents

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous