Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 May;43(7):2289-2310.
doi: 10.1002/hbm.25788. Epub 2022 Mar 4.

Privacy-preserving quality control of neuroimaging datasets in federated environments

Affiliations

Privacy-preserving quality control of neuroimaging datasets in federated environments

Debbrata K Saha et al. Hum Brain Mapp. 2022 May.

Abstract

Privacy concerns for rare disease data, institutional or IRB policies, access to local computational or storage resources or download capabilities are among the reasons that may preclude analyses that pool data to a single site. A growing number of multisite projects and consortia were formed to function in the federated environment to conduct productive research under constraints of this kind. In this scenario, a quality control tool that visualizes decentralized data in its entirety via global aggregation of local computations is especially important, as it would allow the screening of samples that cannot be jointly evaluated otherwise. To solve this issue, we present two algorithms: decentralized data stochastic neighbor embedding, dSNE, and its differentially private counterpart, DP-dSNE. We leverage publicly available datasets to simultaneously map data samples located at different sites according to their similarities. Even though the data never leaves the individual sites, dSNE does not provide any formal privacy guarantees. To overcome that, we rely on differential privacy: a formal mathematical guarantee that protects individuals from being identified as contributors to a dataset. We implement DP-dSNE with AdaCliP, a method recently proposed to add less noise to the gradients per iteration. We introduce metrics for measuring the embedding quality and validate our algorithms on these metrics against their centralized counterpart on two toy datasets. Our validation on six multisite neuroimaging datasets shows promising results for the quality control tasks of visualization and outlier detection, highlighting the potential of our private, decentralized visualization approach.

Keywords: fMRI; federated neuroimaging; quality control; sMRI.

PubMed Disclaimer

Conflict of interest statement

The authors declare no potential conflict of interest.

Figures

FIGURE 1
FIGURE 1
A t‐SNE output on centralized MNIST and COIL‐20 dataset; and outlier‐free convex hull boundaries
FIGURE 2
FIGURE 2
MNIST Experiment: Reference data contains samples of all of the MNIST digits, but is either a small or large amount. In the boxplots, tSNE was computed on pooled data and SMALL and LARGE represent smaller and larger reference datasets, respectively. Each row of the plots correspond to the best and worst performing runs. For each experiment, we ran the simulation 10 times with different random seeds. From the 10 experimental results, we picked the best and worst results, labeled as “best” and “worst” run. The left plots correspond to clusters labeled by digits, whereas the right plots correspond to clusters labeled by sites
FIGURE 3
FIGURE 3
COIL‐20 Experiment 1: reference data contains samples of all COIL‐20 objects but is either in small or large amounts. In the boxplots, tSNE was computed on pooled data and SMALL and LARGE represent smaller and larger reference datasets, respectively. Each row of the plots correspond to the best and worst performing runs. The left plots correspond to clusters labeled by objects, whereas the right plots correspond to clusters labeled by sites. The COIL‐20 dataset consist of 20 different objects which is shown in the figure. In the layout, each point represents an object from these 20 objects
FIGURE 4
FIGURE 4
COIL‐20 Experiment 2: the reference dataset is missing one unique COIL‐20 object that is present at one of the local sites. In the boxplots, tSNE was computed on pooled data and SMALL and LARGE represent smaller and larger reference datasets, respectively. Each row of the plots correspond to the best and worst performing runs. The left plots correspond to clusters labeled by objects, whereas the right plots correspond to clusters labeled by sites
FIGURE 5
FIGURE 5
Experiment for the QC metrics of the ABIDE dataset. (a) the tSNE layout of pooled data; (b–d) are the dSNE layouts for the three different experiments. In each dSNE experiment, 10 local and a coordinator sites participate in the computation. Similar to tSNE, in the decentralized setup, we get 10 clusters, where each site is marked by unique color
FIGURE 6
FIGURE 6
Experiment for the QC metrics of the sMRI dataset. (a) t‐SNE layout of pooled data. (b–d) are the dSNE layouts for the three different experiments. In all of the experiments, there are four total classes corresponding to an age group each and each class is marked by a unique color. The sMRI dataset consists of brain scans from different age group people and one of the brain scans is shown in the figure. These scans are preprocessed before entering the dSNE algorithm. In the layout, each point represents a single individual
FIGURE 7
FIGURE 7
Experiment for QC metrics of the PING dataset. (a) The tSNE layout of the pooled data. In all of the columns, the top figure presents the best performing run, while the lower one represents the worst performing run. (b–e) are the dSNE layouts for four different dSNE experiments. In all of the dSNE experiments, we get four clusters, just like the t‐SNE case. The PING dataset consist of brain imaging data of children and adolescents and one of the scans is shown in the figure. These scan data are preprocessed first and give input to our algorithm. In the layout, each point represents a single individual
FIGURE 8
FIGURE 8
Experiment for the QC metrics of fBIRN and BSNIP datasets. (a) Top and bottom plots represent the t‐SNE layout of fBIRN and BSNIP, respectively. (b) Top and bottom plots represent t‐SNE layouts of combined (fBIRN+BSNIP) datasets but colored by groups and sites, respectively. (c) Top and bottom plots represent dSNE layouts of combined (fBIRN + BSNIP) datasets but colored by groups and sites, respectively. The fBIRN and BSNIP are the brain imaging data of healthy control and Schizophrenia. From these data intrinsic connectivity networks (ICNs) were extracted and used as input to our algorithm. One of the spatial maps of ICNs is shown in the figure
FIGURE 9
FIGURE 9
Experiment for the QC metrics of the MRN fMRI dataset. (a, b) The layouts colored by Scanners for t‐SNE and dSNE, respectively. In both experiments, we get four distinct clusters. Here, we can identify poor quality scan samples marked by the red cluster. In this experiment, three local and one remote sites participated in the computation. In the layout, each point represents a brain scan of an individual
FIGURE 10
FIGURE 10
Experiment for DP‐dSNE of MNIST and PING dataset with σ 2 = 0.001. (a–c) The t‐SNE, dSNE, and DP‐dSNE layout for the MNIST dataset; (d–f) the t‐SNE, dSNE, and DP‐dSNE output for the PING dataset, respectively. We observe that DP‐dSNE gives overall close results to dSNE and centralized t‐SNE. In the MNIST layout, each class is marked by a unique color and in PING layout, each site is marked by a unique color
FIGURE 11
FIGURE 11
Plot of the number of iterations (J) versus the total ϵ given δ = 10−5 and σ 2 = 0.001. The RDP and moments accountant gives smaller values of ϵ over the strong composition method
FIGURE 12
FIGURE 12
An run time demo of dSNE algorithm in coinstac simulator. (a–c) The computation phase of dSNE at the beginning, middle, and at the end of the simulation
FIGURE 13
FIGURE 13
Experiment for outlier detection of the MRN fMRI dataset. In this experiment, the shared sample only contains the bad scans. In both t‐SNE and dSNE, we can successfully identify poor quality scans which is marked by the red color. In this experiment, three local and one remote sites participated in the computation. In the layout, each point represents a brain scan of an individual
FIGURE 14
FIGURE 14
Single‐shot dSNE layout of MNIST data (Saha et al., 2017). Single‐shot was run for the experiment of 1, 3 and 4 of MNIST dataset. For all experiments, we are able to embed and group same digits from different sites with‐out passing any site info to others. Here every digit is marked by a unique color. Centralized is the original tSNE solution for locally grouped data. Digits are correctly grouped into clusters but these clusters tend to heavily overlapped

References

    1. Amadou, G. , Yannick, M. , Julia, I. , Philippe, L. , Andrew, T. , & Elinor, M.J. (2014). DataSHIELD: Taking the analysis to the data, not the data to the analysis. International Journal of Epidemiology, 43(6), 1929–1944. - PMC - PubMed
    1. Amir, G. , Gal, C. , Fernando, P. , & Naftali, T. (2007). Euclidean embedding of co‐occurrence data. Journal of Machine Learning Research, 8, 2265–2295.
    1. Anand, S. , Sergey, P. , Jessica, T. , Mohammad, A. , & Vince, C. (2014). Sharing privacy‐sensitive access to neuroimaging and genetics data: A review and preliminary validation. Frontiers in Neuroinformatics, 8, 35. - PMC - PubMed
    1. Baker, B. T. , Silva, R. , Calhoun, V. , Sarwate, A. , & Plis, S. (2015). Large scale collaboration with autonomy: Decentralized data ICA. In 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP) (pp. 1–6). Boston, MA: IEEE.
    1. Brendan McMahan H., Eider Moore, Daniel Ramage, Agüera Arcas Blaise. (2016) Federated learning of deep networks using model averaging. CoRR.2016;abs/1602.05629.

Publication types