Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Dec;14(12):3426-3444.
doi: 10.1038/s41596-019-0227-6. Epub 2019 Nov 20.

High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (PheCAP)

Affiliations

High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (PheCAP)

Yichi Zhang et al. Nat Protoc. 2019 Dec.

Abstract

Phenotypes are the foundation for clinical and genetic studies of disease risk and outcomes. The growth of biobanks linked to electronic medical record (EMR) data has both facilitated and increased the demand for efficient, accurate, and robust approaches for phenotyping millions of patients. Challenges to phenotyping with EMR data include variation in the accuracy of codes, as well as the high level of manual input required to identify features for the algorithm and to obtain gold standard labels. To address these challenges, we developed PheCAP, a high-throughput semi-supervised phenotyping pipeline. PheCAP begins with data from the EMR, including structured data and information extracted from the narrative notes using natural language processing (NLP). The standardized steps integrate automated procedures, which reduce the level of manual input, and machine learning approaches for algorithm training. PheCAP itself can be executed in 1-2 d if all data are available; however, the timing is largely dependent on the chart review stage, which typically requires at least 2 weeks. The final products of PheCAP include a phenotype algorithm, the probability of the phenotype for all patients, and a phenotype classification (yes or no).

PubMed Disclaimer

Conflict of interest statement

COMPETING INTERESTS

RMP is employed at Celgene, however his contributions to the protocol were performed while at Brigham and Women’s Hospital. The remaining authors declare that they have no competing financial and non-financial interests.

Figures

Figure 1.
Figure 1.. PheCAP Overview.
Starting with all EMR data, a sensitive filter (Procedure step #1) such as a diagnosis code is used to create a data mart (Procedure step #2) containing all patients who may potentially have the phenotype. Codified data, such as diagnoses codes or medication prescriptions related to the phenotype are extracted from the data mart (Procedure steps #5-6). Additionally, concepts or terms related to the phenotype are extracted using natural language processing (NLP) (Procedure steps #10-15). The NLP dictionary can be developed manually or using an automated process. These data are combined into a patient level data table (Procedure step #7). In parallel, a random sample of patients is selected for chart review to provide gold standard labels (Procedure steps #3-4). Sparse machine learning is applied in two steps: an unsupervised (Procedure steps #28-35) and a supervised step (Procedure steps #36-41) to identify the important features of interest. The output of the pipeline is a phenotype algorithm, a probability of the phenotype for all subjects in the data mart, and a classification of the phenotype for each subject (yes/no) (Procedure steps #42-43).
Figure 2.
Figure 2.. Creating an NLP dictionary.
Automated process to generate an NLP dictionary by processing knowledge sources using NLP (Procedure steps #10-14).
Figure 3.
Figure 3.. Unsupervised Feature Learning.
Steps to identify informative codified and NLP features for the algorithm prior to supervised training of the algorithm with gold standard labels (Procedure steps #28-35).
Figure 4.
Figure 4.. Detailed flow of PheCAP protocol.
User input required at various steps in the PheCAP protocol are specified at the top of the figure as the protocol moves from data extraction, data processing, through algorithm training and validation, and the final outputs: a phenotype algorithm, a probability of the phenotype for all subjects in the data mart, and a classification of the phenotype for each subject (yes or no). Numbers in the figure correspond to Procedure steps.
Figure 5.
Figure 5.
Clinical terms identified by MetaMAP along with their mapped CUIs from a Wikipedia article on coronary artery disease (example of results obtained from Procedure step #13).
Figure 6.
Figure 6.
Output from parsing the notes using after processing the i2b2 NLP Research Data Set using NILE (example of results obtained from Procedure step #15).
Figure 7.
Figure 7.
Output from the supervised algorithm training step depicting (a) SAFE selected features with coefficients from the supervised training step (example of results obtained from Procedure step 38); (b) estimated percent of patients classified as cases (pos.rate), false positive rate (FPR), true positive rate (TPR), PPV, NPV, and F-score over a range of cut-off values from validating the algorithm (example of results obtained from Procedure step 41); (c) predicted probability of being a case for patients in the data mart along with their predicted case status, 1=case, 0=non-case (example of results obtained from Procedure steps 42-43).

References

    1. Brownstein JS et al. Rapid identification of myocardial infarction risk associated with diabetes medications using electronic medical records. Diabetes care 33, 526–531 (2010). - PMC - PubMed
    1. Denny JC et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat Biotechnol 31, 1102–1110 (2013). - PMC - PubMed
    1. Kurreeman F et al. Genetic basis of autoantibody positive and negative rheumatoid arthritis risk in a multi-ethnic cohort derived from electronic health records. American journal of human genetics 88, 57–69 (2011). - PMC - PubMed
    1. Liao KP et al. Associations of autoantibodies, autoimmune risk alleles, and clinical diagnoses from the electronic medical records in rheumatoid arthritis cases and non-rheumatoid arthritis controls. Arthritis and rheumatism 65, 571–581 (2013). - PMC - PubMed
    1. Canela-Xandri O, Rawlik K & Tenesa A An atlas of genetic associations in UK Biobank. bioRxiv (2017). - PMC - PubMed

Publication types