. 2020 Dec 9;27(12):1921-1934.

doi: 10.1093/jamia/ocaa139.

Democratizing EHR analyses with FIDDLE: a flexible data-driven preprocessing pipeline for structured clinical data

Shengpu Tang¹, Parmida Davarmanesh², Yanmeng Song³, Danai Koutra¹, Michael W Sjoding^{4

5

6

7}, Jenna Wiens^{1

5

6}

Affiliations

¹ Department of Electrical Engineering and Computer Science, Division of Computer Science and Engineering, University of Michigan, Ann Arbor, USA.
² Department of Mathematics, University of Michigan, Ann Arbor, USA.
³ Department of Statistics, University of Michigan, Ann Arbor, USA.
⁴ Department of Internal Medicine, University of Michigan, Ann Arbor, USA.
⁵ Institution for Healthcare Policy & Innovation, University of Michigan, Ann Arbor, USA.
⁶ Michigan Integrated Center for Health Analytics and Medical Prediction, University of Michigan, Ann Arbor, USA.
⁷ Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, USA.

PMID: 33040151
PMCID: PMC7727385
DOI: 10.1093/jamia/ocaa139

Democratizing EHR analyses with FIDDLE: a flexible data-driven preprocessing pipeline for structured clinical data

Shengpu Tang et al. J Am Med Inform Assoc. 2020.

. 2020 Dec 9;27(12):1921-1934.

doi: 10.1093/jamia/ocaa139.

Authors

Shengpu Tang¹, Parmida Davarmanesh², Yanmeng Song³, Danai Koutra¹, Michael W Sjoding^{4

5

6

7}, Jenna Wiens^{1

5

6}

Affiliations

¹ Department of Electrical Engineering and Computer Science, Division of Computer Science and Engineering, University of Michigan, Ann Arbor, USA.
² Department of Mathematics, University of Michigan, Ann Arbor, USA.
³ Department of Statistics, University of Michigan, Ann Arbor, USA.
⁴ Department of Internal Medicine, University of Michigan, Ann Arbor, USA.
⁵ Institution for Healthcare Policy & Innovation, University of Michigan, Ann Arbor, USA.
⁶ Michigan Integrated Center for Health Analytics and Medical Prediction, University of Michigan, Ann Arbor, USA.
⁷ Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, USA.

PMID: 33040151
PMCID: PMC7727385
DOI: 10.1093/jamia/ocaa139

Abstract

Objective: In applying machine learning (ML) to electronic health record (EHR) data, many decisions must be made before any ML is applied; such preprocessing requires substantial effort and can be labor-intensive. As the role of ML in health care grows, there is an increasing need for systematic and reproducible preprocessing techniques for EHR data. Thus, we developed FIDDLE (Flexible Data-Driven Pipeline), an open-source framework that streamlines the preprocessing of data extracted from the EHR.

Materials and methods: Largely data-driven, FIDDLE systematically transforms structured EHR data into feature vectors, limiting the number of decisions a user must make while incorporating good practices from the literature. To demonstrate its utility and flexibility, we conducted a proof-of-concept experiment in which we applied FIDDLE to 2 publicly available EHR data sets collected from intensive care units: MIMIC-III and the eICU Collaborative Research Database. We trained different ML models to predict 3 clinically important outcomes: in-hospital mortality, acute respiratory failure, and shock. We evaluated models using the area under the receiver operating characteristics curve (AUROC), and compared it to several baselines.

Results: Across tasks, FIDDLE extracted 2,528 to 7,403 features from MIMIC-III and eICU, respectively. On all tasks, FIDDLE-based models achieved good discriminative performance, with AUROCs of 0.757-0.886, comparable to the performance of MIMIC-Extract, a preprocessing pipeline designed specifically for MIMIC-III. Furthermore, our results showed that FIDDLE is generalizable across different prediction times, ML algorithms, and data sets, while being relatively robust to different settings of user-defined arguments.

Conclusions: FIDDLE, an open-source preprocessing pipeline, facilitates applying ML to structured EHR data. By accelerating and standardizing labor-intensive preprocessing, FIDDLE can help stimulate progress in building clinically useful ML tools for EHR data.

Keywords: electronic health records; machine learning; preprocessing pipeline.

PubMed Disclaimer

Figures

**Figure 1.**
Overview of FIDDLE. Given formatted input data and user-defined arguments, FIDDLE processes data in 3 stages: (1) pre-filter, (2) transform, and (3) post-filter. So long as the units are consistent, timestamps in the t column may be recorded at any level of granularity (eg, seconds, minutes, hours, days, visits, etc.). In this sample input file, we consider time in hours. A row with [1, 0.2, Heart Rate, 72] corresponds to a patient with ID = 1 with a heart rate = 72 bpm recorded at t = 0.2 h. In (1) pre-filter, FIDDLE eliminates rare variables. In (2) transform, FIDDLE transforms data into tensors containing time-invariant and time-dependent features. In (3) post-filter, FIDDLE removes redundant features and features that are likely uninformative. The output consists of binary vectors $s_{i}$ and $x_{i}$ , describing the features for each ID. bpm: beats per minute; FIDDLE: Flexible Data-Driven Pipeline: ID: unique identifier; KCl: potassium chloride; WBC: white blood cell.

**Figure 2.**
Examples of FIDDLE input and output for time-invariant and time-dependent data. In this example, each ID represents a patient (an example). Timestamps are recorded in hours. Only the subset of input/output relevant for illustration is shown. The bins for numerical variables and the categories for categorical variables are automatically determined from the entire input data table (not shown). (A) Time-invariant input data and output features for Patient 1. Patient 1 is female with an age of 55. The feature “sex = female” is dropped in the post-filter step because it is perfectly correlated with “sex = male.” (B) Time-dependent input data and output features for Patient 2. At t = 1.5 h, Patient 2 had an insulin administration of 3 units via drug push. No imputation in 2–4 h is done, since the 3 variables related to insulin are not considered “frequent,” resulting in 0 s in the output features for the corresponding time bins. FIDDLE: Flexible Data-Driven Pipeline; ID: unique identifier; IV: intravenous.

**Figure 3.**
Harutyunyan et al definitions of the study cohorts. For each data set (MIMIC-III and eICU), we defined 5 prediction tasks, each with a distinct study cohort: in-hospital mortality at 48 h, ARF at 4 h, ARF at 12 h, shock at 4 h, and shock at 12 h. ARF: acute respiratory failure; ICU: intensive care unit; PEEP: positive end-expiratory pressure.

**Figure 4.**
Dimensionality of feature vectors for each prediction task on MIMIC-III. After applying FIDDLE to the MIMIC-III study cohorts, an ICU visit is represented by time-invariant features and time-dependent features, both of which are high-dimensional. Though the number of time-invariant features is similar across tasks, the number of time-dependent features varies because more data (likely corresponding to more variables) are collected for a later prediction time. FIDDLE: Flexible Data-Driven Pipeline; ICU: intensive care unit.

**Figure 5.**
Model performance (with 95% CI) for prediction of ARF at t = 12 h on MIMIC-III, evaluated on the held-out test set (n = 2093). On this task, all 4 FIDDLE-based models exhibited similarly good discriminative and calibration performance. (A) ROC curves and AUROC scores. (B) PR curves and AUPR scores. (C) Calibration plots and Brier scores. ARF: acute respiratory failure; AUROC: area under the receiver operating characteristics curve; AUPR: area under the precision-recall curve; CI: confidence interval; CNN: convolutional neural networks; FIDDLE: Flexible Data-Driven Pipeline; LR: logistic regression; LSTM: long short-term memory networks; PR: precision-recall curve; RF: random forest; ROC: receiver operating characteristics curve.

See this image and copyright information in PMC

Cited by

Applying Artificial Intelligence in Pediatric Clinical Trials: Potential Impacts and Obstacles.
Foote HP, Cohen-Wolkowiez M, Lindsell CJ, Hornik CP. Foote HP, et al. J Pediatr Pharmacol Ther. 2024 Jun;29(3):336-340. doi: 10.5863/1551-6776-29.3.336. Epub 2024 Jun 10. J Pediatr Pharmacol Ther. 2024. PMID: 38863862 Free PMC article. No abstract available.
Reformulating patient stratification for targeting interventions by accounting for severity of downstream outcomes resulting from disease onset: a case study in sepsis.
Kamran F, Tjandra D, Valley TS, Prescott HC, Shah NH, Liu VX, Horvitz E, Wiens J. Kamran F, et al. J Am Med Inform Assoc. 2025 May 1;32(5):905-913. doi: 10.1093/jamia/ocaf036. J Am Med Inform Assoc. 2025. PMID: 40127468
A data-driven framework for clinical decision support applied to pneumonia management.
Free RC, Lozano Rojas D, Richardson M, Skeemer J, Small L, Haldar P, Woltmann G. Free RC, et al. Front Digit Health. 2023 Oct 9;5:1237146. doi: 10.3389/fdgth.2023.1237146. eCollection 2023. Front Digit Health. 2023. PMID: 37877124 Free PMC article.
EHRchitect: An open-source software tool for medical event sequences data extraction from Electronic Health Records.
Botnar K, Nguen JT, Farnsworth MG, Golovko G, Khanipov K. Botnar K, et al. J Clin Transl Sci. 2025 Mar 26;9(1):e79. doi: 10.1017/cts.2025.55. eCollection 2025. J Clin Transl Sci. 2025. PMID: 40391129 Free PMC article.
Lifting Hospital Electronic Health Record Data Treasures: Challenges and Opportunities.
Maletzky A, Böck C, Tschoellitsch T, Roland T, Ludwig H, Thumfart S, Giretzlehner M, Hochreiter S, Meier J. Maletzky A, et al. JMIR Med Inform. 2022 Oct 21;10(10):e38557. doi: 10.2196/38557. JMIR Med Inform. 2022. PMID: 36269654 Free PMC article.

See all "Cited by" articles

References

1. Wiens J, Horvitz E, Guttag JV.. Patient risk stratification for hospital-associated C. diff as a time-series classification task. In: proceedings of the twenty-sixth annual conference on neural information processing systems (NeurIPS); December 2–6, 2012: 467–76; Lake Tahoe, Nevada.
1. Oh J, Makar M, Fusco C, et al.A generalizable, data-driven approach to predict daily risk of Clostridium difficile infection at two large academic health centers. Infect Control Hosp Epidemiol 2018; 39 (4): 425–33. - PMC - PubMed
1. Li BY, Oh J, Young VB, Rao K, Wiens J.. Using machine learning and the electronic health record to predict complicated Clostridium difficile infection. Open Forum Infect Dis 2019; 6 (5). doi:10.1093/ofid/ofz186. - DOI - PMC - PubMed
1. Desautels T, Calvert J, Hoffman J, et al.Prediction of sepsis in the intensive care unit with minimal electronic health record data: a machine learning approach. JMIR Med Inform 2016; 4 (3): e28. - PMC - PubMed
1. Henry KE, Hager DN, Pronovost PJ, Saria S.. A targeted real-time early warning score (TREWScore) for septic shock. Sci Transl Med 2015; 7 (299): 299ra122. - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Democratizing EHR analyses with FIDDLE: a flexible data-driven preprocessing pipeline for structured clinical data

Affiliations

Democratizing EHR analyses with FIDDLE: a flexible data-driven preprocessing pipeline for structured clinical data

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials