A new tool called DISSECT for analysing large genomic data sets using a Big Data approach

Oriol Canela-Xandri¹, Andy Law¹, Alan Gray², John A Woolliams¹, Albert Tenesa^{1

3}

Affiliations

¹ The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush Campus, Edinburgh EH25 9RG, UK.
² EPCC, The University of Edinburgh, Edinburgh EH9 3FD, UK.
³ MRC HGU at the MRC IGMM, University of Edinburgh, Edinburgh EH4 2XU, UK.

PMID: 26657010
PMCID: PMC4682108
DOI: 10.1038/ncomms10162

A new tool called DISSECT for analysing large genomic data sets using a Big Data approach

Oriol Canela-Xandri et al. Nat Commun. 2015.

. 2015 Dec 11:6:10162.

doi: 10.1038/ncomms10162.

Authors

Oriol Canela-Xandri¹, Andy Law¹, Alan Gray², John A Woolliams¹, Albert Tenesa^{1

3}

Affiliations

¹ The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush Campus, Edinburgh EH25 9RG, UK.
² EPCC, The University of Edinburgh, Edinburgh EH9 3FD, UK.
³ MRC HGU at the MRC IGMM, University of Edinburgh, Edinburgh EH4 2XU, UK.

PMID: 26657010
PMCID: PMC4682108
DOI: 10.1038/ncomms10162

Abstract

Large-scale genetic and genomic data are increasingly available and the major bottleneck in their analysis is a lack of sufficiently scalable computational tools. To address this problem in the context of complex traits analysis, we present DISSECT. DISSECT is a new and freely available software that is able to exploit the distributed-memory parallel computational architectures of compute clusters, to perform a wide range of genomic and epidemiologic analyses, which currently can only be carried out on reduced sample sizes or under restricted conditions. We demonstrate the usefulness of our new tool by addressing the challenge of predicting phenotypes from genotype data in human populations using mixed-linear model analysis. We analyse simulated traits from 470,000 individuals genotyped for 590,004 SNPs in ∼4 h using the combined computational power of 8,400 processor cores. We find that prediction accuracies in excess of 80% of the theoretical maximum could be achieved with large sample sizes.

PubMed Disclaimer

Figures

**Figure 1. Data distribution among compute nodes.**
Single compute nodes have a small number of compute cores and a limited amount of memory. This introduces a limit on the dimensions of the matrices that can be analysed by a single compute node, which in turn affects the sample sizes that can be used in common genomic analysis. To overcome the memory and computational capacity limitations, DISSECT decomposes the matrices into blocks and distributes them between networked compute nodes following a two-dimensional cyclic distribution. Each node performs computations on local data and shares data with other nodes through the network connection when the algorithm requires it. The root node coordinates the other nodes, and collects and distributes inputs and outputs when required. This approach allows great scalability, as it is not restricted by the computational limits of a single compute node.

**Figure 2. Computational requirements for MLM and PCA.**
Computational time (blue lines, left axis) and number of processor cores used (red lines, right axis) in log scale for (a) MLM and (b) PCA analyses as a function of sample size. Core days is the amount of time in days required to complete the analyses multiplied by the number of cores used. It is a rough estimate of the computational time a single computer with a single core would require to perform the analyses if DISSECT scaled perfectly (that is, there was no computational performance penalization due to communication between computer nodes). Labels over the blue dots indicate the real time used for each analysis.

**Figure 3. Prediction accuracy of MLM as a function of sample size and heritability.**
Correlation between true (P) and predicted phenotypes () as a function of cohort size for a trait determined by 10,000 QTNs. Black, blue and red curves represent heritabilities of 0.2, 0.5 and 0.7, respectively. Constant dashed lines indicate the theoretical maximum achievable for each heritability. Error bars are two times the s.d. over 6 replicas (470,000 individuals case has only 1 replica).

**Figure 4. Prediction accuracy when all QTNs were genotyped.**
Correlation between true (P) and predicted phenotypes () as a function of the cohort size when the trait is determined by 10,000 QTNs. Black, blue and red curves represent traits with heritabilities of 0.2, 0.5 and 0.7, respectively. Solid lines are the correlations obtained when all QTNs were genotyped. Dotted lines are the correlations obtained when only ∼20% of QTNs were genotyped. Constant dashed lines indicate the maximum theoretical correlation for each heritability.

See this image and copyright information in PMC

References

1. Marx V. Biology: the big challenges of big data. Nature 498, 255–260 (2013). - PubMed
1. Matilainen K., Mäntysaari E. A., Lidauer M. H., Strandén I. & Thompson R. Employing a Monte Carlo algorithm in Newton-type methods for restricted maximum likelihood estimation of genetic parameters. PLoS ONE 8, e80821 (2013). - PMC - PubMed
1. Abraham G. & Inouye M. Fast principal component analysis of large-scale genome-wide data. PLoS ONE 9, e93766 (2014). - PMC - PubMed
1. Aulchenko Y. S., de Koning D.-J. & Haley C. Genomewide rapid association using mixed model and regression: a fast and simple method for genomewide pedigree-based quantitative trait loci association analysis. Genetics 177, 577–585 (2007). - PMC - PubMed
1. Zhou X. & Stephens M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44, 821–824 (2012). - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A new tool called DISSECT for analysing large genomic data sets using a Big Data approach

Affiliations

A new tool called DISSECT for analysing large genomic data sets using a Big Data approach

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources