Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Dec 11:6:10162.
doi: 10.1038/ncomms10162.

A new tool called DISSECT for analysing large genomic data sets using a Big Data approach

Affiliations

A new tool called DISSECT for analysing large genomic data sets using a Big Data approach

Oriol Canela-Xandri et al. Nat Commun. .

Abstract

Large-scale genetic and genomic data are increasingly available and the major bottleneck in their analysis is a lack of sufficiently scalable computational tools. To address this problem in the context of complex traits analysis, we present DISSECT. DISSECT is a new and freely available software that is able to exploit the distributed-memory parallel computational architectures of compute clusters, to perform a wide range of genomic and epidemiologic analyses, which currently can only be carried out on reduced sample sizes or under restricted conditions. We demonstrate the usefulness of our new tool by addressing the challenge of predicting phenotypes from genotype data in human populations using mixed-linear model analysis. We analyse simulated traits from 470,000 individuals genotyped for 590,004 SNPs in ∼4 h using the combined computational power of 8,400 processor cores. We find that prediction accuracies in excess of 80% of the theoretical maximum could be achieved with large sample sizes.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Data distribution among compute nodes.
Single compute nodes have a small number of compute cores and a limited amount of memory. This introduces a limit on the dimensions of the matrices that can be analysed by a single compute node, which in turn affects the sample sizes that can be used in common genomic analysis. To overcome the memory and computational capacity limitations, DISSECT decomposes the matrices into blocks and distributes them between networked compute nodes following a two-dimensional cyclic distribution. Each node performs computations on local data and shares data with other nodes through the network connection when the algorithm requires it. The root node coordinates the other nodes, and collects and distributes inputs and outputs when required. This approach allows great scalability, as it is not restricted by the computational limits of a single compute node.
Figure 2
Figure 2. Computational requirements for MLM and PCA.
Computational time (blue lines, left axis) and number of processor cores used (red lines, right axis) in log scale for (a) MLM and (b) PCA analyses as a function of sample size. Core days is the amount of time in days required to complete the analyses multiplied by the number of cores used. It is a rough estimate of the computational time a single computer with a single core would require to perform the analyses if DISSECT scaled perfectly (that is, there was no computational performance penalization due to communication between computer nodes). Labels over the blue dots indicate the real time used for each analysis.
Figure 3
Figure 3. Prediction accuracy of MLM as a function of sample size and heritability.
Correlation between true (P) and predicted phenotypes () as a function of cohort size for a trait determined by 10,000 QTNs. Black, blue and red curves represent heritabilities of 0.2, 0.5 and 0.7, respectively. Constant dashed lines indicate the theoretical maximum achievable for each heritability. Error bars are two times the s.d. over 6 replicas (470,000 individuals case has only 1 replica).
Figure 4
Figure 4. Prediction accuracy when all QTNs were genotyped.
Correlation between true (P) and predicted phenotypes () as a function of the cohort size when the trait is determined by 10,000 QTNs. Black, blue and red curves represent traits with heritabilities of 0.2, 0.5 and 0.7, respectively. Solid lines are the correlations obtained when all QTNs were genotyped. Dotted lines are the correlations obtained when only ∼20% of QTNs were genotyped. Constant dashed lines indicate the maximum theoretical correlation for each heritability.

References

    1. Marx V. Biology: the big challenges of big data. Nature 498, 255–260 (2013). - PubMed
    1. Matilainen K., Mäntysaari E. A., Lidauer M. H., Strandén I. & Thompson R. Employing a Monte Carlo algorithm in Newton-type methods for restricted maximum likelihood estimation of genetic parameters. PLoS ONE 8, e80821 (2013). - PMC - PubMed
    1. Abraham G. & Inouye M. Fast principal component analysis of large-scale genome-wide data. PLoS ONE 9, e93766 (2014). - PMC - PubMed
    1. Aulchenko Y. S., de Koning D.-J. & Haley C. Genomewide rapid association using mixed model and regression: a fast and simple method for genomewide pedigree-based quantitative trait loci association analysis. Genetics 177, 577–585 (2007). - PMC - PubMed
    1. Zhou X. & Stephens M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44, 821–824 (2012). - PMC - PubMed

Publication types

LinkOut - more resources