Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Oct;39(5):1372-82.
doi: 10.1093/ije/dyq111. Epub 2010 Jul 14.

DataSHIELD: resolving a conflict in contemporary bioscience--performing a pooled analysis of individual-level data without sharing the data

Affiliations

DataSHIELD: resolving a conflict in contemporary bioscience--performing a pooled analysis of individual-level data without sharing the data

Michael Wolfson et al. Int J Epidemiol. 2010 Oct.

Abstract

Background: Contemporary bioscience sometimes demands vast sample sizes and there is often then no choice but to synthesize data across several studies and to undertake an appropriate pooled analysis. This same need is also faced in health-services and socio-economic research. When a pooled analysis is required, analytic efficiency and flexibility are often best served by combining the individual-level data from all sources and analysing them as a single large data set. But ethico-legal constraints, including the wording of consent forms and privacy legislation, often prohibit or discourage the sharing of individual-level data, particularly across national or other jurisdictional boundaries. This leads to a fundamental conflict in competing public goods: individual-level analysis is desirable from a scientific perspective, but is prevented by ethico-legal considerations that are entirely valid.

Methods: Data aggregation through anonymous summary-statistics from harmonized individual-level databases (DataSHIELD), provides a simple approach to analysing pooled data that circumvents this conflict. This is achieved via parallelized analysis and modern distributed computing and, in one key setting, takes advantage of the properties of the updating algorithm for generalized linear models (GLMs).

Results: The conceptual use of DataSHIELD is illustrated in two different settings.

Conclusions: As the study of the aetiological architecture of chronic diseases advances to encompass more complex causal pathways-e.g. to include the joint effects of genes, lifestyle and environment-sample size requirements will increase further and the analysis of pooled individual-level data will become ever more important. An aim of this conceptual article is to encourage others to address the challenges and opportunities that DataSHIELD presents, and to explore potential extensions, for example to its use when different data sources hold different data on the same individuals.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Schematic representation of structure of scientific problems that DataSHIELD is designed to address. (a) One file: all individual-level data pooled together in one large data file. (b) Partitioned: individual-level data held in six separate data files, one for each study
Figure 2
Figure 2
Schematic representation of the structure of DataSHIELD. The computer controlling analysis (heavily shaded circle) is sited at the analysis centre (MP: master process). The data computers (lightly shaded circles) are each sited at one of the study centres involved in the collaborative analysis (SP: slave process). The arrows indicate the flow of analytic instructions and summary statistics. All potentially disclosive individual-level data are secured on the local data computers

References

    1. Burton PR, Hansell AL, Fortier I, et al. Size matters: just how big is BIG?: Quantifying realistic sample size requirements for human genome epidemiology. Int J Epidemiol. 2009;38:263–73. - PMC - PubMed
    1. Zondervan KT, Cardon LR. Designing candidate gene and genome-wide case-control association studies. Nat Protocols. 2007;2:2492–501. - PMC - PubMed
    1. Spencer CC, Su Z, Donnelly P, Marchini J. Designing genome-wide association studies: sample size, power, imputation, and the choice of genotyping chip. PLoS Genet. 2009;5:e1000477. - PMC - PubMed
    1. Collins FS. The case for a US prospective cohort study of genes and environment. Nature. 2004;429:475–77. - PubMed
    1. Khoury MJ. The case for a global human genome epidemiology initiative. Nat Genet. 2004;36:1027–28. - PubMed

Publication types