Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Oct 21;10(10):e38557.
doi: 10.2196/38557.

Lifting Hospital Electronic Health Record Data Treasures: Challenges and Opportunities

Affiliations

Lifting Hospital Electronic Health Record Data Treasures: Challenges and Opportunities

Alexander Maletzky et al. JMIR Med Inform. .

Abstract

Electronic health records (EHRs) have been successfully used in data science and machine learning projects. However, most of these data are collected for clinical use rather than for retrospective analysis. This means that researchers typically face many different issues when attempting to access and prepare the data for secondary use. We aimed to investigate how raw EHRs can be accessed and prepared in retrospective data science projects in a disciplined, effective, and efficient way. We report our experience and findings from a large-scale data science project analyzing routinely acquired retrospective data from the Kepler University Hospital in Linz, Austria. The project involved data collection from more than 150,000 patients over a period of 10 years. It included diverse data modalities, such as static demographic data, irregularly acquired laboratory test results, regularly sampled vital signs, and high-frequency physiological waveform signals. Raw medical data can be corrupted in many unexpected ways that demand thorough manual inspection and highly individualized data cleaning solutions. We present a general data preparation workflow, which was shaped in the course of our project and consists of the following 7 steps: obtain a rough overview of the available EHR data, define clinically meaningful labels for supervised learning, extract relevant data from the hospital's data warehouses, match data extracted from different sources, deidentify them, detect errors and inconsistencies therein through a careful exploratory analysis, and implement a suitable data processing pipeline in actual code. Only few of the data preparation issues encountered in our project were addressed by generic medical data preprocessing tools that have been proposed recently. Instead, highly individualized solutions for the specific data used in one's own research seem inevitable. We believe that the proposed workflow can serve as a guidance for practitioners, helping them to identify and address potential problems early and avoid some common pitfalls.

Keywords: electronic health record; machine learning; medical data preparation; retrospective data analysis.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: None declared.

Figures

Figure 1
Figure 1
Primary challenges with retrospective medical data analysis (adapted from Johnson et al [18], which is published under Creative Commons Attribution 4.0 International License CC-BY 4.0 [19]).
Figure 2
Figure 2
Data sources and exported modalities in use cases 1 to 5. HIS, PDMS, and Bedmaster are data management systems deployed in the hospital, whereas information about extramural mortality and blood products had to be obtained from external sources. HIS: hospital information system; PDMS: patient data management system; ICU: intensive care unit.
Figure 3
Figure 3
Short periods of constant low values in waveform signals might have to be cut out. Left: original signal with a 0.5-second period of constant low values. Right: signal after cutting out the low value; as can be seen, the 2 ends of the signal fit perfectly.
Figure 4
Figure 4
Data preparation workflow for retrospective EHR data analysis. EHR: electronic health record.

References

    1. Rajkomar A, Oren E, Chen K, Dai AM, Hajaj N, Hardt M, Liu PJ, Liu X, Marcus J, Sun M, Sundberg P, Yee H, Zhang K, Zhang Y, Flores G, Duggan GE, Irvine J, Le Q, Litsch K, Mossin A, Tansuwan J, Wang D, Wexler J, Wilson J, Ludwig D, Volchenboum SL, Chou K, Pearson M, Madabushi S, Shah NH, Butte AJ, Howell MD, Cui C, Corrado GS, Dean J. Scalable and accurate deep learning with electronic health records. NPJ Digit Med. 2018 May 8;1:18. doi: 10.1038/s41746-018-0029-1. doi: 10.1038/s41746-018-0029-1.29 - DOI - DOI - PMC - PubMed
    1. Purushotham S, Meng C, Che Z, Liu Y. Benchmarking deep learning models on large healthcare datasets. J Biomed Inform. 2018 Jul;83:112–34. doi: 10.1016/j.jbi.2018.04.007. https://linkinghub.elsevier.com/retrieve/pii/S1532-0464(18)30071-6 S1532-0464(18)30071-6 - DOI - PubMed
    1. Harutyunyan H, Khachatrian H, Kale DC, Ver Steeg G, Galstyan A. Multitask learning and benchmarking with clinical time series data. Sci Data. 2019 Jun 17;6(1):96. doi: 10.1038/s41597-019-0103-9. doi: 10.1038/s41597-019-0103-9.10.1038/s41597-019-0103-9 - DOI - DOI - PMC - PubMed
    1. Caicedo-Torres W, Gutierrez J. ISeeU: visually interpretable deep learning for mortality prediction inside the ICU. J Biomed Inform. 2019 Oct;98:103269. doi: 10.1016/j.jbi.2019.103269. https://linkinghub.elsevier.com/retrieve/pii/S1532-0464(19)30188-1 S1532-0464(19)30188-1 - DOI - PubMed
    1. Hatib F, Jian Z, Buddi S, Lee C, Settels J, Sibert K, Rinehart J, Cannesson M. Machine-learning algorithm to predict hypotension based on high-fidelity arterial pressure waveform analysis. Anesthesiology. 2018 Oct;129(4):663–74. doi: 10.1097/ALN.0000000000002300. https://pubs.asahq.org/anesthesiology/article-lookup/doi/10.1097/ALN.000... - DOI - DOI - PubMed

LinkOut - more resources