. 2022 Jun;61(S 01):e1-e11.

doi: 10.1055/s-0041-1740564. Epub 2022 Jan 17.

A Privacy-Preserving Distributed Analytics Platform for Health Care Data

Sascha Welten¹, Yongli Mou¹, Laurenz Neumann¹, Mehrshad Jaberansary¹, Yeliz Yediel Ucer², Toralf Kirsten³, Stefan Decker^{1

2}, Oya Beyan^{2

4}

Affiliations

¹ Chair of Computer Science 5, RWTH Aachen University, Aachen, Germany.
² Department of Data Science and Artificial Intelligence, Fraunhofer FIT, Sankt Augustin, Germany.
³ Department of Medical Data Science, University Medical Center Leipzig, Leipzig, Germany.
⁴ Institute for Medical Informatics, Faculty of Medicine, University Hospital Cologne, University of Cologne, Cologne, Germany.

PMID: 35038764
PMCID: PMC9246511
DOI: 10.1055/s-0041-1740564

A Privacy-Preserving Distributed Analytics Platform for Health Care Data

Sascha Welten et al. Methods Inf Med. 2022 Jun.

. 2022 Jun;61(S 01):e1-e11.

doi: 10.1055/s-0041-1740564. Epub 2022 Jan 17.

Authors

Sascha Welten¹, Yongli Mou¹, Laurenz Neumann¹, Mehrshad Jaberansary¹, Yeliz Yediel Ucer², Toralf Kirsten³, Stefan Decker^{1

2}, Oya Beyan^{2

4}

Affiliations

¹ Chair of Computer Science 5, RWTH Aachen University, Aachen, Germany.
² Department of Data Science and Artificial Intelligence, Fraunhofer FIT, Sankt Augustin, Germany.
³ Department of Medical Data Science, University Medical Center Leipzig, Leipzig, Germany.
⁴ Institute for Medical Informatics, Faculty of Medicine, University Hospital Cologne, University of Cologne, Cologne, Germany.

PMID: 35038764
PMCID: PMC9246511
DOI: 10.1055/s-0041-1740564

Abstract

Background: In recent years, data-driven medicine has gained increasing importance in terms of diagnosis, treatment, and research due to the exponential growth of health care data. However, data protection regulations prohibit data centralisation for analysis purposes because of potential privacy risks like the accidental disclosure of data to third parties. Therefore, alternative data usage policies, which comply with present privacy guidelines, are of particular interest.

Objective: We aim to enable analyses on sensitive patient data by simultaneously complying with local data protection regulations using an approach called the Personal Health Train (PHT), which is a paradigm that utilises distributed analytics (DA) methods. The main principle of the PHT is that the analytical task is brought to the data provider and the data instances remain in their original location.

Methods: In this work, we present our implementation of the PHT paradigm, which preserves the sovereignty and autonomy of the data providers and operates with a limited number of communication channels. We further conduct a DA use case on data stored in three different and distributed data providers.

Results: We show that our infrastructure enables the training of data models based on distributed data sources.

Conclusion: Our work presents the capabilities of DA infrastructures in the health care sector, which lower the regulatory obstacles of sharing patient data. We further demonstrate its ability to fuel medical science by making distributed data sets available for scientists or health care practitioners.

The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution-NonDerivative-NonCommercial License, permitting copying and reproduction so long as the original work is given appropriate credit. Contents may not be used for commercial purposes, or adapted, remixed, transformed or built upon. (https://creativecommons.org/licenses/by-nc-nd/4.0/).

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

**Fig. 1**
Personal Health Train (PHT) architecture. This architecture consists of a central managing unit and a separated autonomous Station environment. Each subcomponent is accessible via a browser. All in all, we have four communication channels between the Central Service (CS) component and Station(s): Pull, Push, Queue Request, and Reject. Based on these commands, our Train Configurator manages the synchronisation of the Train Registry (Harbor) and the Central Service database. In parallel, this architecture includes monitoring components (orange). On the Station side, we crawl the necessary information about the local transactions. Via a gateway, the metadata/monitoring data are pushed to a global metadata store and visualised by Grafana.

**Fig. 2**
Train lifecycle diagram. In general, we have two state types in our workflow. The first (yellow) states represent the states of the Train Class in the App Store. If a researcher requests a Train, the Train is instantiated and is following the states in the lower part of the figure. The lower part represents the actual Train circulation in our Station network.

**Fig. 3**
Swimlane diagram for a Train request. Three instances are involved during a Train request and execution. The scientist requests the Train and the Central Service (CS) manages the whole communication with the Station. The Station has the opportunity to reject the Train before each pull/push operation to avoid malicious activities or to prohibit the return of sensitive results.

**Fig. 4**
Overview of the communication between the Central Service (CS) and the Stations. In general, there are three types of repositories. The Train Class repository stores the base images of each Train Class. The User repository is only accessible for the user and stores the latest results representing the history of the analytical task. Further, each Station has its own repository to get the poll information. The replication procedure is automatically done by the central unit.

**Fig. 5**
Censorship of intermediate results. For query results, the Station software detects amendments of the executed Train and hides previous query outputs for the admins of the subsequent Stations. This feature guarantees that actors can only inspect data, which they are allowed to review.

**Fig. 6**
Encryption/Decryption workflow. In this workflow, we assume the Central Service (CS) to be (semi-)trusted. Each entity owns a private/public key pair. For the encryption, we apply envelope encryption of a symmetric key, for example, a *session key* after each Train execution. The encryption concept itself is designed in a way such that only the dedicated recipient is able to decrypt the digital assets.

**Fig. 7**
Two training progresses of the pneumonia detection model Pneumo-Model with different model parameter initialisations. We perform a threefold sequence repetition of our Stations S1, S2, and S3 to train our model, which results in nine model transmissions/training routines each. Each line indicates the model performance progress of each model with respect to our test set. While the blue model training fails, we see that our architecture is able to train the red model, which reaches acceptable performance as the quality metrics indicate.

See this image and copyright information in PMC

Comment in

Security and Privacy in Distributed Health Care Environments.
Flowerday SV, Xenakis C. Flowerday SV, et al. Methods Inf Med. 2022 May;61(1-02):1-2. doi: 10.1055/a-1768-2966. Epub 2022 Feb 10. Methods Inf Med. 2022. PMID: 35144306 Free PMC article. No abstract available.

References

1. Chang K, Balachandar N, Lam C. Distributed deep learning networks among institutions for medical imaging. J Am Med Inform Assoc. 2018;25(08):945–954. - PMC - PubMed
1. Das A, Upadhyaya I, Meng X.Collaborative filtering as a case-study for model parallelism on bulk synchronous systemsIn: ACM Conference on Information and Knowledge Management - CIKM '17. New York, New York, USA: ACM Press;2017969–977.
1. McMahan B, Moore E, Ramage D. PMLR; 2017. Communication-Efficient Learning of Deep Networks from Decentralized Data; pp. 1273–1282.
1. Sheller M J, Reina G A, Edwards B, Martin J, Bakas S. Multi-institutional deep learning modeling without sharing patient data: a feasibility study on brain tumor segmentation. Brainlesion. 2019;11383:92–104. - PMC - PubMed
1. Su H, Chen H.Experiments on parallel training of deep neural network using model averaging 2015. ArXiv: 1507.01239

MeSH terms

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A Privacy-Preserving Distributed Analytics Platform for Health Care Data

Affiliations

A Privacy-Preserving Distributed Analytics Platform for Health Care Data

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Comment in

References

MeSH terms

LinkOut - more resources

Full Text Sources