Integration of genetic and clinical information to improve imputation of data missing from electronic health records
- PMID: 31329892
- PMCID: PMC6748821
- DOI: 10.1093/jamia/ocz041
Integration of genetic and clinical information to improve imputation of data missing from electronic health records
Abstract
Objective: Clinical data of patients' measurements and treatment history stored in electronic health record (EHR) systems are starting to be mined for better treatment options and disease associations. A primary challenge associated with utilizing EHR data is the considerable amount of missing data. Failure to address this issue can introduce significant bias in EHR-based research. Currently, imputation methods rely on correlations among the structured phenotype variables in the EHR. However, genetic studies have shown that many EHR-based phenotypes have a heritable component, suggesting that measured genetic variants might be useful for imputing missing data. In this article, we developed a computational model that incorporates patients' genetic information to perform EHR data imputation.
Materials and methods: We used the individual single nucleotide polymorphism's association with phenotype variables in the EHR as input to construct a genetic risk score that quantifies the genetic contribution to the phenotype. Multiple approaches to constructing the genetic risk score were evaluated for optimal performance. The genetic score, along with phenotype correlation, is then used as a predictor to impute the missing values.
Results: To demonstrate the method performance, we applied our model to impute missing cardiovascular related measurements including low-density lipoprotein, heart failure, and aortic aneurysm disease in the electronic Medical Records and Genomics data. The integration method improved imputation's area-under-the-curve for binary phenotypes and decreased root-mean-square error for continuous phenotypes.
Conclusion: Compared with standard imputation approaches, incorporating genetic information offers a novel approach that can utilize more of the EHR data for better performance in missing data imputation.
Keywords: electronic health record; genetic risk score; imputation; missing data; single nucleotide polymorphisms.
© The Author(s) 2019. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com.
Figures




Similar articles
-
IDENTIFYING GENETIC ASSOCIATIONS WITH VARIABILITY IN METABOLIC HEALTH AND BLOOD COUNT LABORATORY VALUES: DIVING INTO THE QUANTITATIVE TRAITS BY LEVERAGING LONGITUDINAL DATA FROM AN EHR.Pac Symp Biocomput. 2017;22:533-544. doi: 10.1142/9789813207813_0049. Pac Symp Biocomput. 2017. PMID: 27897004
-
Imputation and Missing Indicators for Handling Missing Longitudinal Data: Data Simulation Analysis Based on Electronic Health Record Data.JMIR Med Inform. 2025 Mar 13;13:e64354. doi: 10.2196/64354. JMIR Med Inform. 2025. PMID: 40080075 Free PMC article.
-
INTEGRATING CLINICAL LABORATORY MEASURES AND ICD-9 CODE DIAGNOSES IN PHENOME-WIDE ASSOCIATION STUDIES.Pac Symp Biocomput. 2016;21:168-79. Pac Symp Biocomput. 2016. PMID: 26776183 Free PMC article.
-
Statistical Methods for Phenotype Estimation and Analysis Using Electronic Health Records [Internet].Washington (DC): Patient-Centered Outcomes Research Institute (PCORI); 2021 Mar. Washington (DC): Patient-Centered Outcomes Research Institute (PCORI); 2021 Mar. PMID: 39133799 Free Books & Documents. Review.
-
Implicit bias of encoded variables: frameworks for addressing structured bias in EHR-GWAS data.Hum Mol Genet. 2020 Sep 30;29(R1):R33-R41. doi: 10.1093/hmg/ddaa192. Hum Mol Genet. 2020. PMID: 32879975 Free PMC article. Review.
Cited by
-
Increasing the Density of Laboratory Measures for Machine Learning Applications.J Clin Med. 2020 Dec 30;10(1):103. doi: 10.3390/jcm10010103. J Clin Med. 2020. PMID: 33396741 Free PMC article.
-
Using set visualisation to find and explain patterns of missing values: a case study with NHS hospital episode statistics data.BMJ Open. 2022 Nov 21;12(11):e064887. doi: 10.1136/bmjopen-2022-064887. BMJ Open. 2022. PMID: 36410820 Free PMC article.
-
The relationship of endothelial function and arterial stiffness with subclinical target organ damage in essential hypertension.J Clin Hypertens (Greenwich). 2022 Apr;24(4):418-429. doi: 10.1111/jch.14447. Epub 2022 Mar 3. J Clin Hypertens (Greenwich). 2022. PMID: 35238151 Free PMC article.
-
A narrative review on the validity of electronic health record-based research in epidemiology.BMC Med Res Methodol. 2021 Oct 27;21(1):234. doi: 10.1186/s12874-021-01416-5. BMC Med Res Methodol. 2021. PMID: 34706667 Free PMC article. Review.
-
Machine learning approaches for electronic health records phenotyping: a methodical review.J Am Med Inform Assoc. 2023 Jan 18;30(2):367-381. doi: 10.1093/jamia/ocac216. J Am Med Inform Assoc. 2023. PMID: 36413056 Free PMC article.
References
-
- Prokosch HU, Ganslandt T.. Perspectives for medical informatics. Reusing the electronic medical record for clinical research. Methods Inf Med 2009; 481: 38–44. - PubMed
-
- McClatchey KD. Clinical Laboratory Medicine. Philadelphia, PA: Lippincott Wiliams & Wilkins; 2002: 1693.
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources