Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Dec;20(e2):e355-64.
doi: 10.1136/amiajnl-2013-001946. Epub 2013 Oct 29.

Validating a strategy for psychosocial phenotyping using a large corpus of clinical text

Affiliations

Validating a strategy for psychosocial phenotyping using a large corpus of clinical text

Adi V Gundlapalli et al. J Am Med Inform Assoc. 2013 Dec.

Abstract

Objective: To develop algorithms to improve efficiency of patient phenotyping using natural language processing (NLP) on text data. Of a large number of note titles available in our database, we sought to determine those with highest yield and precision for psychosocial concepts.

Materials and methods: From a database of over 1 billion documents from US Department of Veterans Affairs medical facilities, a random sample of 1500 documents from each of 218 enterprise note titles were chosen. Psychosocial concepts were extracted using a UIMA-AS-based NLP pipeline (v3NLP), using a lexicon of relevant concepts with negation and template format annotators. Human reviewers evaluated a subset of documents for false positives and sensitivity. High-yield documents were identified by hit rate and precision. Reasons for false positivity were characterized.

Results: A total of 58 707 psychosocial concepts were identified from 316 355 documents for an overall hit rate of 0.2 concepts per document (median 0.1, range 1.6-0). Of 6031 concepts reviewed from a high-yield set of note titles, the overall precision for all concept categories was 80%, with variability among note titles and concept categories. Reasons for false positivity included templating, negation, context, and alternate meaning of words. The sensitivity of the NLP system was noted to be 49% (95% CI 43% to 55%).

Conclusions: Phenotyping using NLP need not involve the entire document corpus. Our methods offer a generalizable strategy for scaling NLP pipelines to large free text corpora with complex linguistic annotations in attempts to identify patients of a certain phenotype.

Keywords: clinical informatics; high through-put; natural language processing; patient phenotype; psychosocial concepts.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Relative proportions of 220 US Department of Veterans Affairs (VA) enterprise note titles in document corpus.
Figure 2
Figure 2
Histogram of rank of the concept hit rate of psychosocial concepts extracted by the natural language processing (NLP) pipeline per note title of 218 note titles from the US Department of Veterans Affairs (VA) database. Concepts from the dark shaded note titles were evaluated by human reviewers for the false positive analysis as described in the methods.
Figure 3
Figure 3
Relative prevalence of psychosocial concepts extracted by natural language processing in note titles with highest concept hit rate (top 25) compared with the other 193 note titles in the document corpus.
Figure 4
Figure 4
Bar graph representation of reasons for false positivity of 1223 concepts as determined by human review of a total of 6031 concepts extracted by natural language processing from 35 note titles; the note titles are ordered by their hit rate (highest to lowest).
Figure 5
Figure 5
Examples of phrases and terms leading to false positives of concepts identified by natural language processing of US Department of Veterans Affairs (VA) text documents.

Similar articles

Cited by

References

    1. Balshem H, Christensen V, Tuepker A, et al. A critical review of the literature regarding homelessness among veterans. In: US Department of Veterans Affairs, ed. A critical review of the literature regarding homelessness among veterans. Washington, DC: US Department of Veterans Affairs, 2011:9–43 - PubMed
    1. Lin A, Wood SJ, Yung AR. Measuring psychosocial outcome is good. Curr Opin Psychiatry 2013;26:138–43 - PubMed
    1. Barth J, Schneider S, von Kanel R. Lack of social support in the etiology and the prognosis of coronary heart disease: a systematic review and meta-analysis. Psychosom Med 2010;72:229–38 - PubMed
    1. Calvillo-King L, Arnold D, Eubank KJ, et al. Impact of social factors on risk of readmission or mortality in pneumonia and heart failure: systematic review. J Gen Intern Med 2013;28:269–82 - PMC - PubMed
    1. Denny JC, Ritchie MD, Basford MA, et al. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations. Bioinformatics 2010;26:1205–10 - PMC - PubMed

Publication types