. 2020 Oct 29;20(1):280.

doi: 10.1186/s12911-020-01297-6.

Combining structured and unstructured data for predictive models: a deep learning approach

Dongdong Zhang^{1

2}, Changchang Yin³, Jucheng Zeng^{1

2}, Xiaohui Yuan², Ping Zhang^{4

5}

Affiliations

¹ Department of Biomedical Informatics, The Ohio State University, 1800 Cannon Drive, Columbus, OH, 43210, USA.
² School of Computer Science and Technology, Wuhan University of Technology, Wuhan, 430070, Hubei, China.
³ Department of Computer Science and Engineering, The Ohio State University, 2015 Neil Ave, Columbus, OH, 43210, USA.
⁴ Department of Biomedical Informatics, The Ohio State University, 1800 Cannon Drive, Columbus, OH, 43210, USA. mail.pingzhang@gmail.com.
⁵ Department of Computer Science and Engineering, The Ohio State University, 2015 Neil Ave, Columbus, OH, 43210, USA. mail.pingzhang@gmail.com.

PMID: 33121479
PMCID: PMC7596962
DOI: 10.1186/s12911-020-01297-6

Combining structured and unstructured data for predictive models: a deep learning approach

Dongdong Zhang et al. BMC Med Inform Decis Mak. 2020.

. 2020 Oct 29;20(1):280.

doi: 10.1186/s12911-020-01297-6.

Authors

Dongdong Zhang^{1

2}, Changchang Yin³, Jucheng Zeng^{1

2}, Xiaohui Yuan², Ping Zhang^{4

5}

Affiliations

¹ Department of Biomedical Informatics, The Ohio State University, 1800 Cannon Drive, Columbus, OH, 43210, USA.
² School of Computer Science and Technology, Wuhan University of Technology, Wuhan, 430070, Hubei, China.
³ Department of Computer Science and Engineering, The Ohio State University, 2015 Neil Ave, Columbus, OH, 43210, USA.
⁴ Department of Biomedical Informatics, The Ohio State University, 1800 Cannon Drive, Columbus, OH, 43210, USA. mail.pingzhang@gmail.com.
⁵ Department of Computer Science and Engineering, The Ohio State University, 2015 Neil Ave, Columbus, OH, 43210, USA. mail.pingzhang@gmail.com.

PMID: 33121479
PMCID: PMC7596962
DOI: 10.1186/s12911-020-01297-6

Abstract

Background: The broad adoption of electronic health records (EHRs) provides great opportunities to conduct health care research and solve various clinical problems in medicine. With recent advances and success, methods based on machine learning and deep learning have become increasingly popular in medical informatics. However, while many research studies utilize temporal structured data on predictive modeling, they typically neglect potentially valuable information in unstructured clinical notes. Integrating heterogeneous data types across EHRs through deep learning techniques may help improve the performance of prediction models.

Methods: In this research, we proposed 2 general-purpose multi-modal neural network architectures to enhance patient representation learning by combining sequential unstructured notes with structured data. The proposed fusion models leverage document embeddings for the representation of long clinical note documents and either convolutional neural network or long short-term memory networks to model the sequential clinical notes and temporal signals, and one-hot encoding for static information representation. The concatenated representation is the final patient representation which is used to make predictions.

Results: We evaluate the performance of proposed models on 3 risk prediction tasks (i.e. in-hospital mortality, 30-day hospital readmission, and long length of stay prediction) using derived data from the publicly available Medical Information Mart for Intensive Care III dataset. Our results show that by combining unstructured clinical notes with structured data, the proposed models outperform other models that utilize either unstructured notes or structured data only.

Conclusions: The proposed fusion models learn better patient representation by combining structured and unstructured data. Integrating heterogeneous data types across EHRs helps improve the performance of prediction models and reduce errors.

Keywords: Data fusion; Deep learning; Electronic health records; Time series forecasting.

PubMed Disclaimer

Conflict of interest statement

PZ is the member of the editorial board of BMC Medical Informatics and Decision Making. The authors declare that they have no other competing interests.

Figures

**Fig. 1**
Architecture of CNN-based fusion-CNN. Fusion-CNN uses document embeddings, 2-layer CNN and max-pooling to model sequential clinical notes. Similarly, 2-layer CNN and max-pooling are used to model temporal signals. The final patient representation is the concatenation of the latent representation of sequential clinical notes, temporal signals, and the static information vector. Then the final patient representation is passed to output layers to make predictions

**Fig. 2**
Architecture of LSTM-based Fusion-LSTM. Fusion-LSTM uses document embeddings, a BiLSTM layer, and a max-pooling layer to model sequential clinical notes. 2-layer LSTMs are used to model temporal signals. The concatenated patient representation is passed to output layers to make predictions

**Fig. 3**
Comparison of model running time with different inputs

See this image and copyright information in PMC

References

1. Henry J, Pylypchuk Y, Searcy T, Patel V. Adoption of electronic health record systems among US non-federal acute care hospitals: 2008–2015. ONC Data Brief. 2016;35:1–9.
1. Bisbal M, Jouve E, Papazian L, de Bourmont S, Perrin G, Eon B, et al. Effectiveness of SAPS III to predict hospital mortality for post-cardiac arrest patients. Resuscitation. 2014;85(7):939–944. - PubMed
1. Zimmerman JE, Kramer AA, McNair DS, Malila FM. Acute Physiology and Chronic Health Evaluation (APACHE) IV: hospital mortality assessment for today’s critically ill patients. Crit Care Med. 2006;34(5):1297–1310. - PubMed
1. van Walraven C, Dhalla IA, Bell C, Etchells E, Stiell IG, Zarnke K, et al. Derivation and validation of an index to predict early death or unplanned readmission after discharge from hospital to the community. CMAJ. 2010;182(6):551–557. - PMC - PubMed
1. Donzé J, Aujesky D, Williams D, Schnipper JL. Potentially avoidable 30-day hospital readmissions in medical patients: derivation and validation of a prediction model. JAMA Internal Med. 2013;173(8):632–638. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Combining structured and unstructured data for predictive models: a deep learning approach

Affiliations

Combining structured and unstructured data for predictive models: a deep learning approach

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources