. 2024 Feb 21;15(1):1582.

doi: 10.1038/s41467-024-45777-z.

Data encoding for healthcare data democratization and information leakage prevention

Anshul Thakur¹, Tingting Zhu^#², Vinayak Abrol^#³, Jacob Armstrong², Yujiang Wang^{4

5}, David A Clifton^{2

6}

Affiliations

¹ Department of Engineering Science, University of Oxford, OX3 7DQ, Oxfordshire, UK. anshul.thakur@eng.ox.ac.uk.
² Department of Engineering Science, University of Oxford, OX3 7DQ, Oxfordshire, UK.
³ Infosys Centre for AI, IIIT Delhi, Delhi, India.
⁴ Department of Engineering Science, University of Oxford, OX3 7DQ, Oxfordshire, UK. yujiang.wang@oscar.ox.ac.uk.
⁵ Oxford Suzhou Centre for Advanced Research, Suzhou, China. yujiang.wang@oscar.ox.ac.uk.
⁶ Oxford Suzhou Centre for Advanced Research, Suzhou, China.

^# Contributed equally.

PMID: 38383571
PMCID: PMC10882022
DOI: 10.1038/s41467-024-45777-z

Data encoding for healthcare data democratization and information leakage prevention

Anshul Thakur et al. Nat Commun. 2024.

. 2024 Feb 21;15(1):1582.

doi: 10.1038/s41467-024-45777-z.

Authors

Anshul Thakur¹, Tingting Zhu^#², Vinayak Abrol^#³, Jacob Armstrong², Yujiang Wang^{4

5}, David A Clifton^{2

6}

Affiliations

¹ Department of Engineering Science, University of Oxford, OX3 7DQ, Oxfordshire, UK. anshul.thakur@eng.ox.ac.uk.
² Department of Engineering Science, University of Oxford, OX3 7DQ, Oxfordshire, UK.
³ Infosys Centre for AI, IIIT Delhi, Delhi, India.
⁴ Department of Engineering Science, University of Oxford, OX3 7DQ, Oxfordshire, UK. yujiang.wang@oscar.ox.ac.uk.
⁵ Oxford Suzhou Centre for Advanced Research, Suzhou, China. yujiang.wang@oscar.ox.ac.uk.
⁶ Oxford Suzhou Centre for Advanced Research, Suzhou, China.

^# Contributed equally.

PMID: 38383571
PMCID: PMC10882022
DOI: 10.1038/s41467-024-45777-z

Abstract

The lack of data democratization and information leakage from trained models hinder the development and acceptance of robust deep learning-based healthcare solutions. This paper argues that irreversible data encoding can provide an effective solution to achieve data democratization without violating the privacy constraints imposed on healthcare data and clinical models. An ideal encoding framework transforms the data into a new space where it is imperceptible to a manual or computational inspection. However, encoded data should preserve the semantics of the original data such that deep learning models can be trained effectively. This paper hypothesizes the characteristics of the desired encoding framework and then exploits random projections and random quantum encoding to realize this framework for dense and longitudinal or time-series data. Experimental evaluation highlights that models trained on encoded time-series data effectively uphold the information bottleneck principle and hence, exhibit lesser information leakage from trained models.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1. A schematic illustration depicting the proposed encoding framework and its various components.**
a Conceptual rendition of a multivariate time-series as a collection of multiple 1-d signals. b Illustration of the process of encoding one of the 1-d signals within a time-series using the proposed encoding framework. c Illustration of a quantum circuit that is composed of four wires, unitary rotation gates, and controlled-NOT (CNOT) gates. d Illustration of the setup used for evaluating the latent information leakage from the trained mortality prediction models. Penultimate layer embedding from the trained mortality prediction models is given as input to a linear or dense layer dealing with either gender or patient disorders predictions.

**Fig. 2. Impact of the data encoding on the performance of different deep learning models.**
Performance of LSTM, Vision Transformer (ViT), Transformer, Temporal Convolutional Network (TCN), and Multi-Branch Temporal Convolutional Network (Multi-TCN) on a MIMIC-III, b PhysioNet, and c eICU, respectively, obtained across five different runs. Violin plots illustrate the average performance of all models based on encoding methods for d MIMIC-III, e PhysioNet, and f eICU, respectively. The middle line within each violin plot represents the median, while the lines on either side represent the lower and upper quartiles. Source data are provided as a Source Data file.

**Fig. 3. The extent to which data encoding prevents the leakage of gender information from trained models.**
Gender prediction from the latent embeddings obtained from different models trained on a MIMIC-III, b PhysioNet, and c eICU datasets, respectively. Violin plots illustrate the average performance of all models as a function of the encoding method on d MIMIC-III, e PhysioNet, and f eICU datasets, respectively. Every point on all plots represents the respective model performance obtained during one of the five runs. The middle line within each violin plot represents the median, while the lines on either side represent the lower and upper quartiles. Source data are provided as a Source Data file.

**Fig. 4. The extent to which data encoding prevents the leakage of ethnicity information from the trained models.**
Performance in the latent prediction of a patient’s ethnicity as a Asian, b African-American, c Hispanic, or d Caucasian from various models trained on the eICU dataset, respectively. Similarly, violin plots illustrate the average performance across all models in predicting a patient’s ethnicity as e Asian, f African-American, g Hispanic, or h Caucasian, respectively, based on encoding methods. Every point on all plots represents the respective model performance obtained during one of the five runs. The middle line within each violin plot represents the median, while the lines on either side represent the lower and upper quartiles. Source data are provided as a Source Data file.

**Fig. 5. The extent to which data encoding prevents the leakage of non-targeted patient conditions from trained patient-care models.**
a Model-specific and average performance across all models for predicting 25 latent patient disorders using the penultimate embedding generated from models trained on the MIMIC-III dataset. The chronic and acute disorders shown in b, c are subsets of 25 different conditions considered in this work. A single model predicts the presence/absence of all 25 disorders. Every point on all plots represents the respective model performance obtained during one of the five runs. The middle line within each violin plot represents the median, while the lines on either side represent the lower and upper quartiles. Source data are provided as a Source Data file.

**Fig. 6. Impact of data encoding on the information bottleneck.**
Kernel density estimation plots depict the estimated mutual information (MI) between embeddings derived from trained LSTM models and the averaged input time-series in a MIMIC-III and c PhysioNet. Additionally, similar plots show the estimated MI between embeddings from the trained LSTM models and vectorized input time-series in b MIMIC-III and d PhysioNet. Source data are provided as a Source Data file.

**Fig. 7. Data encoding enhances imperceptibility.**
The difference in average trends and the average magnitude of the original and encoded signals representing a cholesterol, b blood urea nitrogen, c alkaline phosphatase, and d alanine transaminase are examined. These signals are computed by averaging 50-time-series representing patients who eventually face mortality in the PhysioNet dataset. The shaded area surrounding the averaged signal represents the standard deviation. Source data are provided as a Source Data file.

**Fig. 8. Consistency in explainability of models trained on raw and the encoded data.**
A comparison of SHAP-based feature importance in LSTM models trained on a original, b quantum encoded, and c randomly projected versions of the PhysioNet dataset. Source data are provided as a Source Data file.

See this image and copyright information in PMC

References

1. Goodfellow, I., Bengio, Y. & Courville, A. Deep learning. http://www.deeplearningbook.org (MIT Press, 2016).
1. Hinton G. Deep learning—a technology with the potential to transform health care. JAMA. 2018;320:1101–1102. doi: 10.1001/jama.2018.11100. - DOI - PubMed
1. Ravì D, et al. Deep learning for health informatics. IEEE J. Biomed. Health Inform. 2017;21:4–21. doi: 10.1109/JBHI.2016.2636665. - DOI - PubMed
1. Xiao C, Choi E, Sun J. Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review. J. Am. Med. Inform. Assoc. 2018;25:1419–1428. doi: 10.1093/jamia/ocy068. - DOI - PMC - PubMed
1. Wang F, Casalino LP, Khullar D. Deep learning in medicine—promise, progress, and challenges. JAMA Intern. Med. 2019;179:293–294. doi: 10.1001/jamainternmed.2018.7117. - DOI - PubMed

Grants and funding

WT_/Wellcome Trust/United Kingdom

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Data encoding for healthcare data democratization and information leakage prevention

Affiliations

Data encoding for healthcare data democratization and information leakage prevention

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources