Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jun;61(S 01):e1-e11.
doi: 10.1055/s-0041-1740564. Epub 2022 Jan 17.

A Privacy-Preserving Distributed Analytics Platform for Health Care Data

Affiliations

A Privacy-Preserving Distributed Analytics Platform for Health Care Data

Sascha Welten et al. Methods Inf Med. 2022 Jun.

Abstract

Background: In recent years, data-driven medicine has gained increasing importance in terms of diagnosis, treatment, and research due to the exponential growth of health care data. However, data protection regulations prohibit data centralisation for analysis purposes because of potential privacy risks like the accidental disclosure of data to third parties. Therefore, alternative data usage policies, which comply with present privacy guidelines, are of particular interest.

Objective: We aim to enable analyses on sensitive patient data by simultaneously complying with local data protection regulations using an approach called the Personal Health Train (PHT), which is a paradigm that utilises distributed analytics (DA) methods. The main principle of the PHT is that the analytical task is brought to the data provider and the data instances remain in their original location.

Methods: In this work, we present our implementation of the PHT paradigm, which preserves the sovereignty and autonomy of the data providers and operates with a limited number of communication channels. We further conduct a DA use case on data stored in three different and distributed data providers.

Results: We show that our infrastructure enables the training of data models based on distributed data sources.

Conclusion: Our work presents the capabilities of DA infrastructures in the health care sector, which lower the regulatory obstacles of sharing patient data. We further demonstrate its ability to fuel medical science by making distributed data sets available for scientists or health care practitioners.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Fig. 1
Fig. 1
Personal Health Train (PHT) architecture. This architecture consists of a central managing unit and a separated autonomous Station environment. Each subcomponent is accessible via a browser. All in all, we have four communication channels between the Central Service (CS) component and Station(s): Pull, Push, Queue Request, and Reject. Based on these commands, our Train Configurator manages the synchronisation of the Train Registry (Harbor) and the Central Service database. In parallel, this architecture includes monitoring components (orange). On the Station side, we crawl the necessary information about the local transactions. Via a gateway, the metadata/monitoring data are pushed to a global metadata store and visualised by Grafana.
Fig. 2
Fig. 2
Train lifecycle diagram. In general, we have two state types in our workflow. The first (yellow) states represent the states of the Train Class in the App Store. If a researcher requests a Train, the Train is instantiated and is following the states in the lower part of the figure. The lower part represents the actual Train circulation in our Station network.
Fig. 3
Fig. 3
Swimlane diagram for a Train request. Three instances are involved during a Train request and execution. The scientist requests the Train and the Central Service (CS) manages the whole communication with the Station. The Station has the opportunity to reject the Train before each pull/push operation to avoid malicious activities or to prohibit the return of sensitive results.
Fig. 4
Fig. 4
Overview of the communication between the Central Service (CS) and the Stations. In general, there are three types of repositories. The Train Class repository stores the base images of each Train Class. The User repository is only accessible for the user and stores the latest results representing the history of the analytical task. Further, each Station has its own repository to get the poll information. The replication procedure is automatically done by the central unit.
Fig. 5
Fig. 5
Censorship of intermediate results. For query results, the Station software detects amendments of the executed Train and hides previous query outputs for the admins of the subsequent Stations. This feature guarantees that actors can only inspect data, which they are allowed to review.
Fig. 6
Fig. 6
Encryption/Decryption workflow. In this workflow, we assume the Central Service (CS) to be (semi-)trusted. Each entity owns a private/public key pair. For the encryption, we apply envelope encryption of a symmetric key, for example, a session key after each Train execution. The encryption concept itself is designed in a way such that only the dedicated recipient is able to decrypt the digital assets.
Fig. 7
Fig. 7
Two training progresses of the pneumonia detection model Pneumo-Model with different model parameter initialisations. We perform a threefold sequence repetition of our Stations S1, S2, and S3 to train our model, which results in nine model transmissions/training routines each. Each line indicates the model performance progress of each model with respect to our test set. While the blue model training fails, we see that our architecture is able to train the red model, which reaches acceptable performance as the quality metrics indicate.

Comment in

References

    1. Chang K, Balachandar N, Lam C. Distributed deep learning networks among institutions for medical imaging. J Am Med Inform Assoc. 2018;25(08):945–954. - PMC - PubMed
    1. Das A, Upadhyaya I, Meng X.Collaborative filtering as a case-study for model parallelism on bulk synchronous systemsIn: ACM Conference on Information and Knowledge Management - CIKM '17. New York, New York, USA: ACM Press;2017969–977.
    1. McMahan B, Moore E, Ramage D. PMLR; 2017. Communication-Efficient Learning of Deep Networks from Decentralized Data; pp. 1273–1282.
    1. Sheller M J, Reina G A, Edwards B, Martin J, Bakas S. Multi-institutional deep learning modeling without sharing patient data: a feasibility study on brain tumor segmentation. Brainlesion. 2019;11383:92–104. - PMC - PubMed
    1. Su H, Chen H.Experiments on parallel training of deep neural network using model averaging 2015. ArXiv: 1507.01239