Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Dec 9;27(12):1921-1934.
doi: 10.1093/jamia/ocaa139.

Democratizing EHR analyses with FIDDLE: a flexible data-driven preprocessing pipeline for structured clinical data

Affiliations

Democratizing EHR analyses with FIDDLE: a flexible data-driven preprocessing pipeline for structured clinical data

Shengpu Tang et al. J Am Med Inform Assoc. .

Abstract

Objective: In applying machine learning (ML) to electronic health record (EHR) data, many decisions must be made before any ML is applied; such preprocessing requires substantial effort and can be labor-intensive. As the role of ML in health care grows, there is an increasing need for systematic and reproducible preprocessing techniques for EHR data. Thus, we developed FIDDLE (Flexible Data-Driven Pipeline), an open-source framework that streamlines the preprocessing of data extracted from the EHR.

Materials and methods: Largely data-driven, FIDDLE systematically transforms structured EHR data into feature vectors, limiting the number of decisions a user must make while incorporating good practices from the literature. To demonstrate its utility and flexibility, we conducted a proof-of-concept experiment in which we applied FIDDLE to 2 publicly available EHR data sets collected from intensive care units: MIMIC-III and the eICU Collaborative Research Database. We trained different ML models to predict 3 clinically important outcomes: in-hospital mortality, acute respiratory failure, and shock. We evaluated models using the area under the receiver operating characteristics curve (AUROC), and compared it to several baselines.

Results: Across tasks, FIDDLE extracted 2,528 to 7,403 features from MIMIC-III and eICU, respectively. On all tasks, FIDDLE-based models achieved good discriminative performance, with AUROCs of 0.757-0.886, comparable to the performance of MIMIC-Extract, a preprocessing pipeline designed specifically for MIMIC-III. Furthermore, our results showed that FIDDLE is generalizable across different prediction times, ML algorithms, and data sets, while being relatively robust to different settings of user-defined arguments.

Conclusions: FIDDLE, an open-source preprocessing pipeline, facilitates applying ML to structured EHR data. By accelerating and standardizing labor-intensive preprocessing, FIDDLE can help stimulate progress in building clinically useful ML tools for EHR data.

Keywords: electronic health records; machine learning; preprocessing pipeline.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Overview of FIDDLE. Given formatted input data and user-defined arguments, FIDDLE processes data in 3 stages: (1) pre-filter, (2) transform, and (3) post-filter. So long as the units are consistent, timestamps in the t column may be recorded at any level of granularity (eg, seconds, minutes, hours, days, visits, etc.). In this sample input file, we consider time in hours. A row with [1, 0.2, Heart Rate, 72] corresponds to a patient with ID = 1 with a heart rate = 72 bpm recorded at t = 0.2 h. In (1) pre-filter, FIDDLE eliminates rare variables. In (2) transform, FIDDLE transforms data into tensors containing time-invariant and time-dependent features. In (3) post-filter, FIDDLE removes redundant features and features that are likely uninformative. The output consists of binary vectors si and xi, describing the features for each ID. bpm: beats per minute; FIDDLE: Flexible Data-Driven Pipeline: ID: unique identifier; KCl: potassium chloride; WBC: white blood cell.
Figure 2.
Figure 2.
Examples of FIDDLE input and output for time-invariant and time-dependent data. In this example, each ID represents a patient (an example). Timestamps are recorded in hours. Only the subset of input/output relevant for illustration is shown. The bins for numerical variables and the categories for categorical variables are automatically determined from the entire input data table (not shown). (A) Time-invariant input data and output features for Patient 1. Patient 1 is female with an age of 55. The feature “sex = female” is dropped in the post-filter step because it is perfectly correlated with “sex = male.” (B) Time-dependent input data and output features for Patient 2. At t = 1.5 h, Patient 2 had an insulin administration of 3 units via drug push. No imputation in 2–4 h is done, since the 3 variables related to insulin are not considered “frequent,” resulting in 0 s in the output features for the corresponding time bins. FIDDLE: Flexible Data-Driven Pipeline; ID: unique identifier; IV: intravenous.
Figure 3.
Figure 3.
Harutyunyan et al definitions of the study cohorts. For each data set (MIMIC-III and eICU), we defined 5 prediction tasks, each with a distinct study cohort: in-hospital mortality at 48 h, ARF at 4 h, ARF at 12 h, shock at 4 h, and shock at 12 h. ARF: acute respiratory failure; ICU: intensive care unit; PEEP: positive end-expiratory pressure.
Figure 4.
Figure 4.
Dimensionality of feature vectors for each prediction task on MIMIC-III. After applying FIDDLE to the MIMIC-III study cohorts, an ICU visit is represented by time-invariant features and time-dependent features, both of which are high-dimensional. Though the number of time-invariant features is similar across tasks, the number of time-dependent features varies because more data (likely corresponding to more variables) are collected for a later prediction time. FIDDLE: Flexible Data-Driven Pipeline; ICU: intensive care unit.
Figure 5.
Figure 5.
Model performance (with 95% CI) for prediction of ARF at t = 12 h on MIMIC-III, evaluated on the held-out test set (n = 2093). On this task, all 4 FIDDLE-based models exhibited similarly good discriminative and calibration performance. (A) ROC curves and AUROC scores. (B) PR curves and AUPR scores. (C) Calibration plots and Brier scores. ARF: acute respiratory failure; AUROC: area under the receiver operating characteristics curve; AUPR: area under the precision-recall curve; CI: confidence interval; CNN: convolutional neural networks; FIDDLE: Flexible Data-Driven Pipeline; LR: logistic regression; LSTM: long short-term memory networks; PR: precision-recall curve; RF: random forest; ROC: receiver operating characteristics curve.

Similar articles

Cited by

References

    1. Wiens J, Horvitz E, Guttag JV.. Patient risk stratification for hospital-associated C. diff as a time-series classification task. In: proceedings of the twenty-sixth annual conference on neural information processing systems (NeurIPS); December 2–6, 2012: 467–76; Lake Tahoe, Nevada.
    1. Oh J, Makar M, Fusco C, et al.A generalizable, data-driven approach to predict daily risk of Clostridium difficile infection at two large academic health centers. Infect Control Hosp Epidemiol 2018; 39 (4): 425–33. - PMC - PubMed
    1. Li BY, Oh J, Young VB, Rao K, Wiens J.. Using machine learning and the electronic health record to predict complicated Clostridium difficile infection. Open Forum Infect Dis 2019; 6 (5). doi:10.1093/ofid/ofz186. - DOI - PMC - PubMed
    1. Desautels T, Calvert J, Hoffman J, et al.Prediction of sepsis in the intensive care unit with minimal electronic health record data: a machine learning approach. JMIR Med Inform 2016; 4 (3): e28. - PMC - PubMed
    1. Henry KE, Hager DN, Pronovost PJ, Saria S.. A targeted real-time early warning score (TREWScore) for septic shock. Sci Transl Med 2015; 7 (299): 299ra122. - PubMed

Publication types