Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2018 Apr;19(4):208-219.
doi: 10.1038/nrg.2017.113. Epub 2018 Jan 30.

Cloud computing for genomic data analysis and collaboration

Affiliations
Review

Cloud computing for genomic data analysis and collaboration

Ben Langmead et al. Nat Rev Genet. 2018 Apr.

Erratum in

Abstract

Next-generation sequencing has made major strides in the past decade. Studies based on large sequencing data sets are growing in number, and public archives for raw sequencing data have been doubling in size every 18 months. Leveraging these data requires researchers to use large-scale computational resources. Cloud computing, a model whereby users rent computers and storage from large data centres, is a solution that is gaining traction in genomics research. Here, we describe how cloud computing is used in genomics for research and large-scale collaborations, and argue that its elasticity, reproducibility and privacy features make it ideally suited for the large-scale reanalysis of publicly available archived data, including privacy-protected data.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
Four doublings of the Sequence Read Archive from July 2012 to March 2017. The large jump in October 2016 is chiefly due to the TopMed project. As of June 2017, the SRA contains over 12 petabases (millions of billions of bases) of data.
Figure 2:
Figure 2:
Elasticity allows the user to rent resources while paying only for what gets used. Panel (a) illustrates a scenario with two computational tasks to perform, colored red and blue. The red task requires 36 computer-hours and runs on up to 8 computers simultaneously. The blue task requires 18 computer-hours and runs on 3 computers simultaneously. On a smaller cluster (left) both the tasks run sequentially and require 15 hours to complete. On a larger cluster (right), representing a cloud cluster, the tasks can run simultaneously and the red task can use its full complement of 8 computers. As a result, both complete within 6 hours. This ignores the fact that many more users are contending for cloud clusters than are contending for an institutional cluster. The greater number of users is mitigated by the fact that needs and timing vary from user to user. Cloud providers also provide incentives, such as spot pricing, to encourage renting at less busy times.
Figure 3:
Figure 3:
Each site (blue rectangle) has some computational resources and also generates a portion of the data (red puzzle pieces). In (a) analysis that require the full datasets are to be performed at multiple sites, requiring each of these sites to gather all portions of the data. As more sites join the analysis, more copies must be made. (b) and (c) are alternate solutions. In (b) sites consolidate their data in a cloud-based data center, where all analyses are performed. In (c), multiple sites organize themselves into a federated cloud, where each analysis of the full dataset is automatically coordinated to minimize data transfer. Where possible, the computers located where data are generated are used to analyze that subset.

References

    1. Melé M et al. Human genomics. The human transcriptome across tissues and individuals. Science 348, 660–665, doi:10.1126/science.aaa0355 (2015). - DOI - PMC - PubMed
    1. Leinonen R, Sugawara H, Shumway M & on behalf of the International Nucleotide Sequence Database, C. The Sequence Read Archive. Nucleic Acids Res 39, D19–D21, doi:10.1093/nar/gkq1019 (2010). - DOI - PMC - PubMed
    1. Denk F Don’t let useful data go to waste. Nature 543, 7, doi:10.1038/543007a (2017). - DOI - PubMed
    1. Yung CK et al. Abstract 3605: ICGC in the cloud. Cancer Res 76, 3605–3605, doi:10.1158/1538-7445.am2016-3605 (2016). - DOI
    1. Stein LD, Knoppers BM, Campbell P, Getz G & Korbel JO Data analysis: Create a cloud commons. Nature 523, 149–151, doi:10.1038/523149a (2015). - DOI - PubMed

Publication types