Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2010 Sep;11(9):647-57.
doi: 10.1038/nrg2857.

Computational solutions to large-scale data management and analysis

Affiliations
Review

Computational solutions to large-scale data management and analysis

Eric E Schadt et al. Nat Rev Genet. 2010 Sep.

Abstract

Today we can generate hundreds of gigabases of DNA and RNA sequencing data in a week for less than US$5,000. The astonishing rate of data generation by these low-cost, high-throughput technologies in genomics is being matched by that of other technologies, such as real-time imaging and mass spectrometry-based flow cytometry. Success in the life sciences will depend on our ability to properly interpret the large-scale, high-dimensional data sets that are generated by these technologies, which in turn requires us to adopt advances in informatics. Here we discuss how we can master the different types of computational environments that exist - such as cloud and heterogeneous computing - to successfully tackle our big data problems.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Generating and integrating large-scale, diverse types of data
Modelling living systems will require generating (a) and integrating (b) multidimensional data sets. In b, large-scale, complex data sets are shown as a network in which the nodes represent variables of biological interest, such as DNA variation, RNA variation, protein levels, protein states, metabolite levels and disease-associated traits, and the edges between these nodes represent causal relationships between the variables. These more granular networks (at the gene level) can be effectively summarized into subnetworks (c) that interact with one another both within and between tissues. In this way, a network-centred view is obtained of how core biological processes interact with one another to define physiological states associated with disease. Part b is adapted, with permission, from REF. © (2009) Macmillan Publishers Ltd. All rights reserved.
Figure 2
Figure 2. Cluster, cloud, grid and heterogeneous computing hardware and software stacks
The hardware and software stacks comprise the different layers of a computational environment. At the lowest level of the stack is the physical structure that houses the hardware, with networking infrastructure coming next, and then the physical computers or servers. Sitting on top of the physical hardware is the virtualization layer, and the operating system lies on top of that. Finally, there are the software infrastructure and application layers. The different types of computing can be differentiated by which of these layers are under the user’s direct control (solid line) and which levels are provided by others, for example, the cloud provider and grid volunteer (dashed lines). Cloud and grid services are best suited for applications with loosely coupled, or coarse-grained, parallelism. Heterogeneous systems include specialized hardware accelerators, such as graphics processing units (GPUs). These accelerators are optimized for massive tightly coupled, or fine-grained, parallelism. However, the software that runs on these accelerators differs from its general purpose processor (GPP) counterparts, and often must be specifically written for a particular accelerator. MPI, message passing interface.
Figure 3
Figure 3. Amazon Web Services
Amazon Web Services provides a simple and intuitive web-based interface into the Amazon S3 storage services and Amazon EC2 cloud resources. a | The management console available in Amazon Web Services provides a convenient interface into Amazon’s cloud-based services, including direct access to Amazon S3 and Amazon EC2 for data storage and large-scale computing, respectively. b | Steps for using the management console to compute big data using Amazon’s Elastic MapReduce resource (see main text for details).

References

    1. Eid J, et al. Real-time DNA sequencing from single polymerase molecules. Science. 2009;323:133–138. - PubMed
    1. Bandura DR, et al. Mass cytometry: technique for real time single cell multitarget immunoassay based on inductively coupled plasma time-of-flight mass spectrometry. Anal. Chem. 2009;81:6813–6822. - PubMed
    1. Chen Y, et al. Variations in DNA elucidate molecular networks that cause disease. Nature. 2008;452:429–435. - PMC - PubMed
    1. Emilsson V, et al. Genetics of gene expression and its effect on disease. Nature. 2008;452:423–428. - PubMed
    1. Altshuler D, Daly MJ, Lander ES. Genetic mapping in human disease. Science. 2008;322:881–888. - PMC - PubMed