Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Dec;43(6):1929-44.
doi: 10.1093/ije/dyu188. Epub 2014 Sep 26.

DataSHIELD: taking the analysis to the data, not the data to the analysis

Affiliations

DataSHIELD: taking the analysis to the data, not the data to the analysis

Amadou Gaye et al. Int J Epidemiol. 2014 Dec.

Abstract

Background: Research in modern biomedicine and social science requires sample sizes so large that they can often only be achieved through a pooled co-analysis of data from several studies. But the pooling of information from individuals in a central database that may be queried by researchers raises important ethico-legal questions and can be controversial. In the UK this has been highlighted by recent debate and controversy relating to the UK's proposed 'care.data' initiative, and these issues reflect important societal and professional concerns about privacy, confidentiality and intellectual property. DataSHIELD provides a novel technological solution that can circumvent some of the most basic challenges in facilitating the access of researchers and other healthcare professionals to individual-level data.

Methods: Commands are sent from a central analysis computer (AC) to several data computers (DCs) storing the data to be co-analysed. The data sets are analysed simultaneously but in parallel. The separate parallelized analyses are linked by non-disclosive summary statistics and commands transmitted back and forth between the DCs and the AC. This paper describes the technical implementation of DataSHIELD using a modified R statistical environment linked to an Opal database deployed behind the computer firewall of each DC. Analysis is controlled through a standard R environment at the AC.

Results: Based on this Opal/R implementation, DataSHIELD is currently used by the Healthy Obese Project and the Environmental Core Project (BioSHaRE-EU) for the federated analysis of 10 data sets across eight European countries, and this illustrates the opportunities and challenges presented by the DataSHIELD approach.

Conclusions: DataSHIELD facilitates important research in settings where: (i) a co-analysis of individual-level data from several studies is scientifically necessary but governance restrictions prohibit the release or sharing of some of the required data, and/or render data access unacceptably slow; (ii) a research group (e.g. in a developing nation) is particularly vulnerable to loss of intellectual property-the researchers want to fully share the information held in their data with national and international collaborators, but do not wish to hand over the physical data themselves; and (iii) a data set is to be included in an individual-level co-analysis but the physical size of the data precludes direct transfer to a new site for analysis.

Keywords: DataSHIELD; ELSI; bioinformatics; confidentiality; disclosure; distributed computing; intellectual property; pooled analysis; privacy.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Typical DataSHIELD setting for a pooled individual-level analysis.
Figure 2.
Figure 2.
Overview of the IT infrastructure required for a DataSHIELD process. The settings are the same in all DCs so only one is highlighted in this figure.
Figure 3.
Figure 3.
Graphical view of pooled data (a), horizontally partitioned (b) and vertically partitioned data (c).
Figure 4.
Figure 4.
Overview of a DataSHIELD process. Each of the 8 steps and the terms used to refer to the key components and data exchanged between AC and DCs are detailed in Table 1.
Figure 5.
Figure 5.
For the Healthy Obese Project, communications between AC and DCs were channelled through a trusted portal.
Figure 6.
Figure 6.
Histogram plots of the variable ‘LAB_HDL' for each study (A) and for the pooled data (B).
Figure 7.
Figure 7.
Illustration of DataSHIELD set-up for the analyses of: (a) horizontally partitioned data (similar data, different individuals) held in GP databases and/or data centres. (**Single-site DataSHIELD); and (b) vertically partitioned data requiring record linkage between different types of data on the same individuals held in a variety of data archives.

References

    1. Burton PR, Tobin MD, Hopper JL. Key concepts in genetic epidemiology. Lancet 2005;366:941––51.. - PubMed
    1. Spencer CC, Su Z, Donnelly P, Marchini J. Designing genome-wide association studies: sample size, power, imputation, and the choice of genotyping chip. PLoS Genet 2009;5:e1000477. - PMC - PubMed
    1. Zondervan KT, Cardon LR. Designing candidate gene and genome-wide case-control association studies. Nat Protocols 2007;2:2492–501. - PMC - PubMed
    1. Walport M, Brest P. Sharing research data to improve public health. Lancet 2011;377:537–39. - PubMed
    1. Burton PR, Hansell AL, Fortier I, et al. Size matters: just how big is BIG? Quantifying realistic sample size requirements for human genome epidemiology. Int J Epidemiol 2008;38:263–73. - PMC - PubMed

Publication types