Computational solutions to large-scale data management and analysis

Eric E Schadt¹, Michael D Linderman, Jon Sorenson, Lawrence Lee, Garry P Nolan

Affiliations

PMID: 20717155
PMCID: PMC3124937
DOI: 10.1038/nrg2857

Review

Computational solutions to large-scale data management and analysis

Eric E Schadt et al. Nat Rev Genet. 2010 Sep.

. 2010 Sep;11(9):647-57.

doi: 10.1038/nrg2857.

Authors

Eric E Schadt¹, Michael D Linderman, Jon Sorenson, Lawrence Lee, Garry P Nolan

Affiliation

¹ Pacific Biosciences, Menlo Park, California 94025, USA. eschadt@pacificbiosciences.com

PMID: 20717155
PMCID: PMC3124937
DOI: 10.1038/nrg2857

Abstract

Today we can generate hundreds of gigabases of DNA and RNA sequencing data in a week for less than US$5,000. The astonishing rate of data generation by these low-cost, high-throughput technologies in genomics is being matched by that of other technologies, such as real-time imaging and mass spectrometry-based flow cytometry. Success in the life sciences will depend on our ability to properly interpret the large-scale, high-dimensional data sets that are generated by these technologies, which in turn requires us to adopt advances in informatics. Here we discuss how we can master the different types of computational environments that exist - such as cloud and heterogeneous computing - to successfully tackle our big data problems.

PubMed Disclaimer

Figures

**Figure 1. Generating and integrating large-scale, diverse types of data**
Modelling living systems will require generating (a) and integrating (b) multidimensional data sets. In b, large-scale, complex data sets are shown as a network in which the nodes represent variables of biological interest, such as DNA variation, RNA variation, protein levels, protein states, metabolite levels and disease-associated traits, and the edges between these nodes represent causal relationships between the variables. These more granular networks (at the gene level) can be effectively summarized into subnetworks (c) that interact with one another both within and between tissues. In this way, a network-centred view is obtained of how core biological processes interact with one another to define physiological states associated with disease. Part b is adapted, with permission, from REF. © (2009) Macmillan Publishers Ltd. All rights reserved.

**Figure 2. Cluster, cloud, grid and heterogeneous computing hardware and software stacks**
The hardware and software stacks comprise the different layers of a computational environment. At the lowest level of the stack is the physical structure that houses the hardware, with networking infrastructure coming next, and then the physical computers or servers. Sitting on top of the physical hardware is the virtualization layer, and the operating system lies on top of that. Finally, there are the software infrastructure and application layers. The different types of computing can be differentiated by which of these layers are under the user’s direct control (solid line) and which levels are provided by others, for example, the cloud provider and grid volunteer (dashed lines). Cloud and grid services are best suited for applications with loosely coupled, or coarse-grained, parallelism. Heterogeneous systems include specialized hardware accelerators, such as graphics processing units (GPUs). These accelerators are optimized for massive tightly coupled, or fine-grained, parallelism. However, the software that runs on these accelerators differs from its general purpose processor (GPP) counterparts, and often must be specifically written for a particular accelerator. MPI, message passing interface.

**Figure 3. Amazon Web Services**
Amazon Web Services provides a simple and intuitive web-based interface into the Amazon S3 storage services and Amazon EC2 cloud resources. a | The management console available in Amazon Web Services provides a convenient interface into Amazon’s cloud-based services, including direct access to Amazon S3 and Amazon EC2 for data storage and large-scale computing, respectively. b | Steps for using the management console to compute big data using Amazon’s Elastic MapReduce resource (see main text for details).

See this image and copyright information in PMC

References

1. Eid J, et al. Real-time DNA sequencing from single polymerase molecules. Science. 2009;323:133–138. - PubMed
1. Bandura DR, et al. Mass cytometry: technique for real time single cell multitarget immunoassay based on inductively coupled plasma time-of-flight mass spectrometry. Anal. Chem. 2009;81:6813–6822. - PubMed
1. Chen Y, et al. Variations in DNA elucidate molecular networks that cause disease. Nature. 2008;452:429–435. - PMC - PubMed
1. Emilsson V, et al. Genetics of gene expression and its effect on disease. Nature. 2008;452:423–428. - PubMed
1. Altshuler D, Daly MJ, Lander ES. Genetic mapping in human disease. Science. 2008;322:881–888. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Computational solutions to large-scale data management and analysis

Affiliation

Computational solutions to large-scale data management and analysis

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources