. 2017:22:207-218.

doi: 10.1142/9789813207813_0021.

MISSING DATA IMPUTATION IN THE ELECTRONIC HEALTH RECORD USING DEEPLY LEARNED AUTOENCODERS

Brett K Beaulieu-Jones¹, Jason H Moore

Affiliations

Affiliation

¹ Genomics and Computational Biology Graduate Group, Computational Genetics Lab, Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, 3700 Hamilton Walk, Philadelphia PA, 19104, USA, brettbe@med.upenn.edu.

PMID: 27896976
PMCID: PMC5144587
DOI: 10.1142/9789813207813_0021

MISSING DATA IMPUTATION IN THE ELECTRONIC HEALTH RECORD USING DEEPLY LEARNED AUTOENCODERS

Brett K Beaulieu-Jones et al. Pac Symp Biocomput. 2017.

. 2017:22:207-218.

doi: 10.1142/9789813207813_0021.

Authors

Brett K Beaulieu-Jones¹, Jason H Moore

Affiliation

¹ Genomics and Computational Biology Graduate Group, Computational Genetics Lab, Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, 3700 Hamilton Walk, Philadelphia PA, 19104, USA, brettbe@med.upenn.edu.

PMID: 27896976
PMCID: PMC5144587
DOI: 10.1142/9789813207813_0021

Abstract

Electronic health records (EHRs) have become a vital source of patient outcome data but the widespread prevalence of missing data presents a major challenge. Different causes of missing data in the EHR data may introduce unintentional bias. Here, we compare the effectiveness of popular multiple imputation strategies with a deeply learned autoencoder using the Pooled Resource Open-Access ALS Clinical Trials Database (PRO-ACT). To evaluate performance, we examined imputation accuracy for known values simulated to be either missing completely at random or missing not at random. We also compared ALS disease progression prediction across different imputation models. Autoencoders showed strong performance for imputation accuracy and contributed to the strongest disease progression predictor. Finally, we show that despite clinical heterogeneity, ALS disease progression appears homogenous with time from onset being the most important predictor.

PubMed Disclaimer

Figures

**Figure 1**
Schematic structure of the autoencoder used for evaluations, with two hidden layers and 20% dropout between each layer.

**Figure 2**
Evaluation outline **(a)** Imputation Evaluation. PRO-ACT patient data of 10,723 subjects has known data masked with spiked in missing data. Imputation strategies are performed in parallel and the RMSE is calculated between the masked input data and each strategies imputations. **(b)** Progression Prediction. PRO-ACT patients are imputed using each strategy. Ten-fold cross validation of a random forest regressor is performed on imputed patients.

**Figure 3**
Histogram distribution and rug plot showing the number of patients each feature is present in. **(a)** The number of features each patient has. Ticks at the bottom indicate one patient with the count of features, bins indicate the number of patients in a range. **(b)** The number of patients having a recorded value for each feature. Ticks at the bottom indicate the number of patients a feature is present in, bins indicate the number of features in a range.

**Figure 4**
Effect of the amount of spiked-in missing data on imputation. Error bars indicate 5-fold cross validation score ranges.

**Figure 5**
Effect of non-random spiked-in missing data on imputation (measured in root mean squared error). Autoencoder w/Dropout (2 layer 500 nodes each), SVD – SVDImpute with rank of 40, KNN - KNNimpute with 7 neighbors, Mean – Column Mean Averaging, Median – column median averaging, SI – SoftImpute.

**Figure 6**
ALS Functional Rating Scale prediction accuracy shown for an autoencoder, k-nearest neighbors, mean averaging, median averaging, the raw input including missing values, soft impute and singular value decomposition. The box indicates inner quartiles with the line representing the median; the whiskers indicate outer quartiles excluding outliers.

**Figure 7**
Prediction feature importance. **(a)** Importance levels of the top 10 features to the random forest regressor with autoencoder imputed data. **(b)** Histogram distribution of patient ALSFRS slope levels.

See this image and copyright information in PMC

References

1. Sterne JJaC, White IRI, Carlin JJB, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009;338(July):b2393. doi: 10.1136/bmj.b2393. - DOI - PMC - PubMed
1. Wells BJ, Chagin KM, Nowacki AS, Kattan MW. Strategies for handling missing data in electronic health record derived data. EGEMS (Washington, DC) 2013;1(3):1035. doi: 10.13063/2327-9214.1035. - DOI - PMC - PubMed
1. McClatchey KD. Clinical Laboratory Medicine. Lippincott Williams & Wilkins; 2002.
1. Little R, Rubin D. Statistical Analysis with Missing Data. John Wiley & Sons; 2014.
1. Marlin B. [Accessed August 7, 2016];Missing data problems in machine learning. 2008 http://www-devel.cs.ubc.ca/~bmarlin/research/phd_thesis/marlin-phd-thesi....

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

MISSING DATA IMPUTATION IN THE ELECTRONIC HEALTH RECORD USING DEEPLY LEARNED AUTOENCODERS

Affiliation

MISSING DATA IMPUTATION IN THE ELECTRONIC HEALTH RECORD USING DEEPLY LEARNED AUTOENCODERS

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous