Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017:22:207-218.
doi: 10.1142/9789813207813_0021.

MISSING DATA IMPUTATION IN THE ELECTRONIC HEALTH RECORD USING DEEPLY LEARNED AUTOENCODERS

Affiliations

MISSING DATA IMPUTATION IN THE ELECTRONIC HEALTH RECORD USING DEEPLY LEARNED AUTOENCODERS

Brett K Beaulieu-Jones et al. Pac Symp Biocomput. 2017.

Abstract

Electronic health records (EHRs) have become a vital source of patient outcome data but the widespread prevalence of missing data presents a major challenge. Different causes of missing data in the EHR data may introduce unintentional bias. Here, we compare the effectiveness of popular multiple imputation strategies with a deeply learned autoencoder using the Pooled Resource Open-Access ALS Clinical Trials Database (PRO-ACT). To evaluate performance, we examined imputation accuracy for known values simulated to be either missing completely at random or missing not at random. We also compared ALS disease progression prediction across different imputation models. Autoencoders showed strong performance for imputation accuracy and contributed to the strongest disease progression predictor. Finally, we show that despite clinical heterogeneity, ALS disease progression appears homogenous with time from onset being the most important predictor.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Schematic structure of the autoencoder used for evaluations, with two hidden layers and 20% dropout between each layer.
Figure 2
Figure 2
Evaluation outline (a) Imputation Evaluation. PRO-ACT patient data of 10,723 subjects has known data masked with spiked in missing data. Imputation strategies are performed in parallel and the RMSE is calculated between the masked input data and each strategies imputations. (b) Progression Prediction. PRO-ACT patients are imputed using each strategy. Ten-fold cross validation of a random forest regressor is performed on imputed patients.
Figure 3
Figure 3
Histogram distribution and rug plot showing the number of patients each feature is present in. (a) The number of features each patient has. Ticks at the bottom indicate one patient with the count of features, bins indicate the number of patients in a range. (b) The number of patients having a recorded value for each feature. Ticks at the bottom indicate the number of patients a feature is present in, bins indicate the number of features in a range.
Figure 4
Figure 4
Effect of the amount of spiked-in missing data on imputation. Error bars indicate 5-fold cross validation score ranges.
Figure 5
Figure 5
Effect of non-random spiked-in missing data on imputation (measured in root mean squared error). Autoencoder w/Dropout (2 layer 500 nodes each), SVD – SVDImpute with rank of 40, KNN - KNNimpute with 7 neighbors, Mean – Column Mean Averaging, Median – column median averaging, SI – SoftImpute.
Figure 6
Figure 6
ALS Functional Rating Scale prediction accuracy shown for an autoencoder, k-nearest neighbors, mean averaging, median averaging, the raw input including missing values, soft impute and singular value decomposition. The box indicates inner quartiles with the line representing the median; the whiskers indicate outer quartiles excluding outliers.
Figure 7
Figure 7
Prediction feature importance. (a) Importance levels of the top 10 features to the random forest regressor with autoencoder imputed data. (b) Histogram distribution of patient ALSFRS slope levels.

References

    1. Sterne JJaC, White IRI, Carlin JJB, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009;338(July):b2393. doi: 10.1136/bmj.b2393. - DOI - PMC - PubMed
    1. Wells BJ, Chagin KM, Nowacki AS, Kattan MW. Strategies for handling missing data in electronic health record derived data. EGEMS (Washington, DC) 2013;1(3):1035. doi: 10.13063/2327-9214.1035. - DOI - PMC - PubMed
    1. McClatchey KD. Clinical Laboratory Medicine. Lippincott Williams & Wilkins; 2002.
    1. Little R, Rubin D. Statistical Analysis with Missing Data. John Wiley & Sons; 2014.
    1. Marlin B. [Accessed August 7, 2016];Missing data problems in machine learning. 2008 http://www-devel.cs.ubc.ca/~bmarlin/research/phd_thesis/marlin-phd-thesi....

Publication types

MeSH terms