. 2019 Dec;14(12):3426-3444.

doi: 10.1038/s41596-019-0227-6. Epub 2019 Nov 20.

High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (PheCAP)

Yichi Zhang^#¹, Tianrun Cai^#², Sheng Yu^#^{3

4}, Kelly Cho^{5

6}, Chuan Hong¹, Jiehuan Sun¹, Jie Huang², Yuk-Lam Ho⁵, Ashwin N Ananthakrishnan⁷, Zongqi Xia⁸, Stanley Y Shaw⁹, Vivian Gainer¹⁰, Victor Castro¹⁰, Nicholas Link⁵, Jacqueline Honerlaw⁵, Sicong Huang², David Gagnon^{5

11}, Elizabeth W Karlson², Robert M Plenge², Peter Szolovits¹², Guergana Savova¹³, Susanne Churchill¹⁴, Christopher O'Donnell^{5

15}, Shawn N Murphy^{10

14

16}, J Michael Gaziano^{5

6}, Isaac Kohane¹⁴, Tianxi Cai^{1

14}, Katherine P Liao^{17

18

19}

Affiliations

¹ Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
² Division of Rheumatology, Immunology, and Allergy, Brigham and Women's Hospital, Boston, MA, USA.
³ Center for Statistical Science, Tsinghua University, Beijing, China.
⁴ Department of Industrial Engineering, Tsinghua University, Beijing, China.
⁵ Division of Data Sciences, VA Boston Healthcare System, Boston, MA, USA.
⁶ Division of Aging, Brigham and Women's Hospital, Boston, MA, USA.
⁷ Department of Gastroenterology, Massachusetts General Hospital, Boston, MA, USA.
⁸ Department of Neurology, University of Pittsburgh, Pittsburgh, PA, USA.
⁹ Division of Cardiovascular Medicine, Brigham and Women's Hospital, Boston, MA, USA.
¹⁰ Research Information Science and Computing, Partners Healthcare, Boston, MA, USA.
¹¹ Department of Biostatistics, Boston University, Boston, MA, USA.
¹² Department of Electrical Engineering and Computer Science, MIT, Cambridge, MA, USA.
¹³ Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, USA.
¹⁴ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
¹⁵ Division of Cardiology, VA Boston Healthcare System, Boston, MA, USA.
¹⁶ Department of Neurology, Massachusetts General Hospital, Boston, MA, USA.
¹⁷ Division of Rheumatology, Immunology, and Allergy, Brigham and Women's Hospital, Boston, MA, USA. kliao@bwh.harvard.edu.
¹⁸ Division of Data Sciences, VA Boston Healthcare System, Boston, MA, USA. kliao@bwh.harvard.edu.
¹⁹ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA. kliao@bwh.harvard.edu.

^# Contributed equally.

PMID: 31748751
PMCID: PMC7323894
DOI: 10.1038/s41596-019-0227-6

High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (PheCAP)

Yichi Zhang et al. Nat Protoc. 2019 Dec.

. 2019 Dec;14(12):3426-3444.

doi: 10.1038/s41596-019-0227-6. Epub 2019 Nov 20.

Authors

Affiliations

¹ Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
² Division of Rheumatology, Immunology, and Allergy, Brigham and Women's Hospital, Boston, MA, USA.
³ Center for Statistical Science, Tsinghua University, Beijing, China.
⁴ Department of Industrial Engineering, Tsinghua University, Beijing, China.
⁵ Division of Data Sciences, VA Boston Healthcare System, Boston, MA, USA.
⁶ Division of Aging, Brigham and Women's Hospital, Boston, MA, USA.
⁷ Department of Gastroenterology, Massachusetts General Hospital, Boston, MA, USA.
⁸ Department of Neurology, University of Pittsburgh, Pittsburgh, PA, USA.
⁹ Division of Cardiovascular Medicine, Brigham and Women's Hospital, Boston, MA, USA.
¹⁰ Research Information Science and Computing, Partners Healthcare, Boston, MA, USA.
¹¹ Department of Biostatistics, Boston University, Boston, MA, USA.
¹² Department of Electrical Engineering and Computer Science, MIT, Cambridge, MA, USA.
¹³ Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, USA.
¹⁴ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
¹⁵ Division of Cardiology, VA Boston Healthcare System, Boston, MA, USA.
¹⁶ Department of Neurology, Massachusetts General Hospital, Boston, MA, USA.
¹⁷ Division of Rheumatology, Immunology, and Allergy, Brigham and Women's Hospital, Boston, MA, USA. kliao@bwh.harvard.edu.
¹⁸ Division of Data Sciences, VA Boston Healthcare System, Boston, MA, USA. kliao@bwh.harvard.edu.
¹⁹ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA. kliao@bwh.harvard.edu.

^# Contributed equally.

PMID: 31748751
PMCID: PMC7323894
DOI: 10.1038/s41596-019-0227-6

Abstract

Phenotypes are the foundation for clinical and genetic studies of disease risk and outcomes. The growth of biobanks linked to electronic medical record (EMR) data has both facilitated and increased the demand for efficient, accurate, and robust approaches for phenotyping millions of patients. Challenges to phenotyping with EMR data include variation in the accuracy of codes, as well as the high level of manual input required to identify features for the algorithm and to obtain gold standard labels. To address these challenges, we developed PheCAP, a high-throughput semi-supervised phenotyping pipeline. PheCAP begins with data from the EMR, including structured data and information extracted from the narrative notes using natural language processing (NLP). The standardized steps integrate automated procedures, which reduce the level of manual input, and machine learning approaches for algorithm training. PheCAP itself can be executed in 1-2 d if all data are available; however, the timing is largely dependent on the chart review stage, which typically requires at least 2 weeks. The final products of PheCAP include a phenotype algorithm, the probability of the phenotype for all patients, and a phenotype classification (yes or no).

PubMed Disclaimer

Conflict of interest statement

COMPETING INTERESTS

RMP is employed at Celgene, however his contributions to the protocol were performed while at Brigham and Women’s Hospital. The remaining authors declare that they have no competing financial and non-financial interests.

Figures

**Figure 1.. PheCAP Overview.**
Starting with all EMR data, a sensitive filter (**Procedure step #1**) such as a diagnosis code is used to create a data mart (**Procedure step #2**) containing all patients who may potentially have the phenotype. Codified data, such as diagnoses codes or medication prescriptions related to the phenotype are extracted from the data mart (**Procedure steps #5-6**). Additionally, concepts or terms related to the phenotype are extracted using natural language processing (NLP) (**Procedure steps #10-15**). The NLP dictionary can be developed manually or using an automated process. These data are combined into a patient level data table (**Procedure step #7**). In parallel, a random sample of patients is selected for chart review to provide gold standard labels (**Procedure steps #3-4**). Sparse machine learning is applied in two steps: an unsupervised (**Procedure steps #28-35**) and a supervised step (**Procedure steps #36-41**) to identify the important features of interest. The output of the pipeline is a phenotype algorithm, a probability of the phenotype for all subjects in the data mart, and a classification of the phenotype for each subject (yes/no) (**Procedure steps #42-43**).

**Figure 2.. Creating an NLP dictionary.**
Automated process to generate an NLP dictionary by processing knowledge sources using NLP (**Procedure steps #10-14**).

**Figure 3.. Unsupervised Feature Learning.**
Steps to identify informative codified and NLP features for the algorithm prior to supervised training of the algorithm with gold standard labels (**Procedure steps #28-35**).

**Figure 4.. Detailed flow of PheCAP protocol.**
User input required at various steps in the PheCAP protocol are specified at the top of the figure as the protocol moves from data extraction, data processing, through algorithm training and validation, and the final outputs: a phenotype algorithm, a probability of the phenotype for all subjects in the data mart, and a classification of the phenotype for each subject (yes or no). Numbers in the figure correspond to **Procedure steps**.

**Figure 5.**
Clinical terms identified by MetaMAP along with their mapped CUIs from a Wikipedia article on coronary artery disease **(example of results obtained from Procedure step #13)**.

**Figure 6.**
Output from parsing the notes using after processing the i2b2 NLP Research Data Set using NILE **(example of results obtained from Procedure step #15)**.

**Figure 7.**
Output from the supervised algorithm training step depicting (a) SAFE selected features with coefficients from the supervised training step **(example of results obtained from Procedure step 38);** (b) estimated percent of patients classified as cases (pos.rate), false positive rate (FPR), true positive rate (TPR), PPV, NPV, and F-score over a range of cut-off values from validating the algorithm **(example of results obtained from Procedure step 41**); (c) predicted probability of being a case for patients in the data mart along with their predicted case status, 1=case, 0=non-case **(example of results obtained from Procedure steps 42-43**).

See this image and copyright information in PMC

References

1. Brownstein JS et al. Rapid identification of myocardial infarction risk associated with diabetes medications using electronic medical records. Diabetes care 33, 526–531 (2010). - PMC - PubMed
1. Denny JC et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat Biotechnol 31, 1102–1110 (2013). - PMC - PubMed
1. Kurreeman F et al. Genetic basis of autoantibody positive and negative rheumatoid arthritis risk in a multi-ethnic cohort derived from electronic health records. American journal of human genetics 88, 57–69 (2011). - PMC - PubMed
1. Liao KP et al. Associations of autoantibodies, autoimmune risk alleles, and clinical diagnoses from the electronic medical records in rheumatoid arthritis cases and non-rheumatoid arthritis controls. Arthritis and rheumatism 65, 571–581 (2013). - PMC - PubMed
1. Canela-Xandri O, Rawlik K & Tenesa A An atlas of genetic associations in UK Biobank. bioRxiv (2017). - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

U54 LM008748/LM/NLM NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (PheCAP)

Affiliations

High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (PheCAP)

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources