Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Oct 1;26(10):1056-1063.
doi: 10.1093/jamia/ocz041.

Integration of genetic and clinical information to improve imputation of data missing from electronic health records

Affiliations

Integration of genetic and clinical information to improve imputation of data missing from electronic health records

Ruowang Li et al. J Am Med Inform Assoc. .

Abstract

Objective: Clinical data of patients' measurements and treatment history stored in electronic health record (EHR) systems are starting to be mined for better treatment options and disease associations. A primary challenge associated with utilizing EHR data is the considerable amount of missing data. Failure to address this issue can introduce significant bias in EHR-based research. Currently, imputation methods rely on correlations among the structured phenotype variables in the EHR. However, genetic studies have shown that many EHR-based phenotypes have a heritable component, suggesting that measured genetic variants might be useful for imputing missing data. In this article, we developed a computational model that incorporates patients' genetic information to perform EHR data imputation.

Materials and methods: We used the individual single nucleotide polymorphism's association with phenotype variables in the EHR as input to construct a genetic risk score that quantifies the genetic contribution to the phenotype. Multiple approaches to constructing the genetic risk score were evaluated for optimal performance. The genetic score, along with phenotype correlation, is then used as a predictor to impute the missing values.

Results: To demonstrate the method performance, we applied our model to impute missing cardiovascular related measurements including low-density lipoprotein, heart failure, and aortic aneurysm disease in the electronic Medical Records and Genomics data. The integration method improved imputation's area-under-the-curve for binary phenotypes and decreased root-mean-square error for continuous phenotypes.

Conclusion: Compared with standard imputation approaches, incorporating genetic information offers a novel approach that can utilize more of the EHR data for better performance in missing data imputation.

Keywords: electronic health record; genetic risk score; imputation; missing data; single nucleotide polymorphisms.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Overview of the imputation model. Complete data were used to assess each SNP’s association to the phenotype (Steps 1–5). A GRS is then used to summarize multiple SNPs based on their associations (Step 6). The GRS as well as other clinical variables are then used to impute the missing values (Step 7). The variability of the imputation is assessed using 100 different cross-validations (Step 8).
Figure 2.
Figure 2.
Impact of incorporating genetic information on imputation accuracy of AAA and HF. The 3 vertical panels indicate different percentages of missing data (10%, 30%, and 50%). Horizontal panels show the 6 different disease and consent group combinations. The red color band represents accuracies using SNPs selected by AUC from 100 repetitions. The green color band represents P value selection. From left to right, the x-axis represents GRSs calculated from increasing number of SNPs, eg, SNP(1), SNP(1, 2), and SNP(1, 2, 3… 500). The y-axis shows the imputation AUC on the testing data.
Figure 3.
Figure 3.
Comparison of imputation models on LDL. Vertical panels show different percentages of missing data. GRS P value and GRS R2 consist of top 500 SNPs selected by P value or r-squared, respectively. Statistical significances were obtained using t-test.
Figure 4.
Figure 4.
Improved power on known HF associated SNPs. Five HF associated SNPs showed significance (P < 10−8, or equivalently -log10(p) > 8) in at least 1 setting. The box-plots show the p values associated with each of the 5 SNPs without imputation (missing), imputing using AUC selected SNPs (impute_auc), and imputing using P value selected SNPs (impute_P) over 100 repetitions.

Similar articles

Cited by

References

    1. Prokosch HU, Ganslandt T.. Perspectives for medical informatics. Reusing the electronic medical record for clinical research. Methods Inf Med 2009; 481: 38–44. - PubMed
    1. Wells BJ, Chagin KM, Nowacki AS, Kattan MW.. Strategies for handling missing data in electronic health record derived data. EGEMS (Washington, DC )2013; 13: 1035. - PMC - PubMed
    1. McClatchey KD. Clinical Laboratory Medicine. Philadelphia, PA: Lippincott Wiliams & Wilkins; 2002: 1693.
    1. Banerjee D, Chung S, Wong EC, Wang EJ, Stafford RS, Palaniappan LP.. Underdiagnosis of hypertension using electronic health records. Am J Hypertens 2012; 251: 97–102. - PMC - PubMed
    1. Shivade C, Raghavan P, Fosler-Lussier E, et al. A review of approaches to identifying patient phenotype cohorts using electronic health records. J Am Med Inform Assoc 2014; 212: 221–30. - PMC - PubMed

Publication types