. 2016 Nov;23(6):1166-1173.

doi: 10.1093/jamia/ocw028. Epub 2016 May 12.

Learning statistical models of phenotypes using noisy labeled training data

Vibhu Agarwal¹, Tanya Podchiyska², Juan M Banda³, Veena Goel^{4

5}, Tiffany I Leung⁶, Evan P Minty^{2

7}, Timothy E Sweeney^{2

8}, Elsie Gyang⁹, Nigam H Shah³

Affiliations

¹ Biomedical Informatics Training Program, Stanford University, Stanford CA 94305-5479, USA vibhua@stanford.edu.
² Biomedical Informatics Training Program, Stanford University, Stanford CA 94305-5479, USA.
³ Stanford Center for Biomedical Informatics Research, Stanford University, Stanford CA 94305-5479, USA.
⁴ Department of Pediatrics, Stanford University School of Medicine, Stanford CA 94305-5208, USA.
⁵ Department of Clinical Informatics, Stanford Children's Health, Stanford CA 94305-5474, USA.
⁶ Division of General Medical Disciplines, Stanford University, Stanford CA 94305, USA.
⁷ Faculty of Medicine, University of Calgary, Calgary Alberta, T2N 4N1, Canada.
⁸ Department of Surgery, Stanford Hospital & Clinics, Stanford CA 94305-2200, USA.
⁹ Division of Vascular Surgery, Stanford Hospital & Clinics, Stanford CA 94305-5642, USA.

PMID: 27174893
PMCID: PMC5070523
DOI: 10.1093/jamia/ocw028

Learning statistical models of phenotypes using noisy labeled training data

Vibhu Agarwal et al. J Am Med Inform Assoc. 2016 Nov.

. 2016 Nov;23(6):1166-1173.

doi: 10.1093/jamia/ocw028. Epub 2016 May 12.

Authors

Vibhu Agarwal¹, Tanya Podchiyska², Juan M Banda³, Veena Goel^{4

5}, Tiffany I Leung⁶, Evan P Minty^{2

7}, Timothy E Sweeney^{2

8}, Elsie Gyang⁹, Nigam H Shah³

Affiliations

¹ Biomedical Informatics Training Program, Stanford University, Stanford CA 94305-5479, USA vibhua@stanford.edu.
² Biomedical Informatics Training Program, Stanford University, Stanford CA 94305-5479, USA.
³ Stanford Center for Biomedical Informatics Research, Stanford University, Stanford CA 94305-5479, USA.
⁴ Department of Pediatrics, Stanford University School of Medicine, Stanford CA 94305-5208, USA.
⁵ Department of Clinical Informatics, Stanford Children's Health, Stanford CA 94305-5474, USA.
⁶ Division of General Medical Disciplines, Stanford University, Stanford CA 94305, USA.
⁷ Faculty of Medicine, University of Calgary, Calgary Alberta, T2N 4N1, Canada.
⁸ Department of Surgery, Stanford Hospital & Clinics, Stanford CA 94305-2200, USA.
⁹ Division of Vascular Surgery, Stanford Hospital & Clinics, Stanford CA 94305-5642, USA.

PMID: 27174893
PMCID: PMC5070523
DOI: 10.1093/jamia/ocw028

Abstract

Objective: Traditionally, patient groups with a phenotype are selected through rule-based definitions whose creation and validation are time-consuming. Machine learning approaches to electronic phenotyping are limited by the paucity of labeled training datasets. We demonstrate the feasibility of utilizing semi-automatically labeled training sets to create phenotype models via machine learning, using a comprehensive representation of the patient medical record.

Methods: We use a list of keywords specific to the phenotype of interest to generate noisy labeled training data. We train L1 penalized logistic regression models for a chronic and an acute disease and evaluate the performance of the models against a gold standard.

Results: Our models for Type 2 diabetes mellitus and myocardial infarction achieve precision and accuracy of 0.90, 0.89, and 0.86, 0.89, respectively. Local implementations of the previously validated rule-based definitions for Type 2 diabetes mellitus and myocardial infarction achieve precision and accuracy of 0.96, 0.92 and 0.84, 0.87, respectively.We have demonstrated feasibility of learning phenotype models using imperfectly labeled data for a chronic and acute phenotype. Further research in feature engineering and in specification of the keyword list can improve the performance of the models and the scalability of the approach.

Conclusions: Our method provides an alternative to manual labeling for creating training sets for statistical models of phenotypes. Such an approach can accelerate research with large observational healthcare datasets and may also be used to create local phenotype models.

Keywords: Electronic health record; high throughput; machine learning; noisy labels; phenotyping.

PubMed Disclaimer

Figures

**Figure 1:**
Evaluating the performance of statistical models learned from semi-automatically labeled data with noisy labels ( A ) Existing rule-based phenotype definitions for the phenotypes are implemented using SQL. ( B ) Using a list of phenotype specific keywords, patient records are labeled has having or not having the phenotype; thus creating a noisy labeled training dataset. Features are constructed based on terms in notes, diagnostic codes, prescription, and lab orders. Keywords used in the noisy labeling are excluded. The data matrix is split into training and test sets for training a statistical model and for carrying out 5-fold cross-validation. ( C ) A manually reviewed gold standard set of patient records is created (excluding those used for training the model) and is used to evaluate both the rule-based definition and the statistical model for each phenotype.

**Figure 2:**
Construction of the list of keywords used to assign noisy labels. First, a list of synonymous terms for concepts representing the descriptive phrase for the phenotype is generated. The list is sorted by frequency of mentions and the terms covering 90% of the mentions are inspected to remove terms that are ambiguous or not specific to the phenotype of interest.

**Figure 3:**
Engineering features from structured and unstructured data elements in a patient record.

See this image and copyright information in PMC

Cited by

Machine Learning in Rheumatic Diseases.
Jiang M, Li Y, Jiang C, Zhao L, Zhang X, Lipsky PE. Jiang M, et al. Clin Rev Allergy Immunol. 2021 Feb;60(1):96-110. doi: 10.1007/s12016-020-08805-6. Clin Rev Allergy Immunol. 2021. PMID: 32681407 Review.
Predicting Future Cardiovascular Events in Patients With Peripheral Artery Disease Using Electronic Health Record Data.
Ross EG, Jung K, Dudley JT, Li L, Leeper NJ, Shah NH. Ross EG, et al. Circ Cardiovasc Qual Outcomes. 2019 Mar;12(3):e004741. doi: 10.1161/CIRCOUTCOMES.118.004741. Circ Cardiovasc Qual Outcomes. 2019. PMID: 30857412 Free PMC article.
Feature extraction for phenotyping from semantic and knowledge resources.
Ning W, Chan S, Beam A, Yu M, Geva A, Liao K, Mullen M, Mandl KD, Kohane I, Cai T, Yu S. Ning W, et al. J Biomed Inform. 2019 Mar;91:103122. doi: 10.1016/j.jbi.2019.103122. Epub 2019 Feb 7. J Biomed Inform. 2019. PMID: 30738949 Free PMC article.
A Roadmap for Foundational Research on Artificial Intelligence in Medical Imaging: From the 2018 NIH/RSNA/ACR/The Academy Workshop.
Langlotz CP, Allen B, Erickson BJ, Kalpathy-Cramer J, Bigelow K, Cook TS, Flanders AE, Lungren MP, Mendelson DS, Rudie JD, Wang G, Kandarpa K. Langlotz CP, et al. Radiology. 2019 Jun;291(3):781-791. doi: 10.1148/radiol.2019190613. Epub 2019 Apr 16. Radiology. 2019. PMID: 30990384 Free PMC article.
Representing and utilizing clinical textual data for real world studies: An OHDSI approach.
Keloth VK, Banda JM, Gurley M, Heider PM, Kennedy G, Liu H, Liu F, Miller T, Natarajan K, V Patterson O, Peng Y, Raja K, Reeves RM, Rouhizadeh M, Shi J, Wang X, Wang Y, Wei WQ, Williams AE, Zhang R, Belenkaya R, Reich C, Blacketer C, Ryan P, Hripcsak G, Elhadad N, Xu H. Keloth VK, et al. J Biomed Inform. 2023 Jun;142:104343. doi: 10.1016/j.jbi.2023.104343. Epub 2023 Mar 17. J Biomed Inform. 2023. PMID: 36935011 Free PMC article.

See all "Cited by" articles

References

1. Longhurst CA Harrington RA Shah NH . A ‘green button' for using aggregate patient data at the point of care .Health Aff (Millwood). 2014. ; 33 ( 7 ): 1229 – 1235 . - PubMed
1. Pathak J Kho AN Denny JC . Electronic health records-driven phenotyping: challenges, recent advances, and perspectives .J Am Med Inform Assoc. 2013. ; 20 ( e2 ): e206 – e211 . - PMC - PubMed
1. Shah NH . Mining the ultimate phenome repository .Nat Biotechnol. 2013. ; 31 ( 12 ): 1095 – 1097 . - PMC - PubMed
1. Shivade C Raghavan P Fosler-Lussier E Embi PJ Elhadad N Johnson SB et al. . A review of approaches to identifying patient phenotype cohorts using electronic health records .J Am Med Inform Assoc. 2014. ; 21 ( 2 ): 221 – 230 . - PMC - PubMed
1. Gottesman O Kuivaniemi H Tromp G Faucett WA Li R Manolio TA et al. . The Electronic Medical Records and Genomics (eMERGE) Network: past, present, and future .Genet Med. 2013. ; 15 ( 10 ): 761 – 771 . - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Learning statistical models of phenotypes using noisy labeled training data

Affiliations

Learning statistical models of phenotypes using noisy labeled training data

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources