Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Nov-Dec;21(6):969-75.
doi: 10.1136/amiajnl-2013-002155. Epub 2014 Jan 24.

Bionimbus: a cloud for managing, analyzing and sharing large genomics datasets

Affiliations

Bionimbus: a cloud for managing, analyzing and sharing large genomics datasets

Allison P Heath et al. J Am Med Inform Assoc. 2014 Nov-Dec.

Abstract

Background: As large genomics and phenotypic datasets are becoming more common, it is increasingly difficult for most researchers to access, manage, and analyze them. One possible approach is to provide the research community with several petabyte-scale cloud-based computing platforms containing these data, along with tools and resources to analyze it.

Methods: Bionimbus is an open source cloud-computing platform that is based primarily upon OpenStack, which manages on-demand virtual machines that provide the required computational resources, and GlusterFS, which is a high-performance clustered file system. Bionimbus also includes Tukey, which is a portal, and associated middleware that provides a single entry point and a single sign on for the various Bionimbus resources; and Yates, which automates the installation, configuration, and maintenance of the software infrastructure required.

Results: Bionimbus is used by a variety of projects to process genomics and phenotypic data. For example, it is used by an acute myeloid leukemia resequencing project at the University of Chicago. The project requires several computational pipelines, including pipelines for quality control, alignment, variant calling, and annotation. For each sample, the alignment step requires eight CPUs for about 12 h. BAM file sizes ranged from 5 GB to 10 GB for each sample.

Conclusions: Most members of the research community have difficulty downloading large genomics datasets and obtaining sufficient storage and computer resources to manage and analyze the data. Cloud computing platforms, such as Bionimbus, with data commons that contain large genomics datasets, are one choice for broadening access to research data in genomics.

Keywords: biomedical clouds; cloud computing; genomic clouds.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Screenshot of the Tukey console from the Bionimbus Protected Data Cloud. A user can click an image and then click the button labeled ‘Launch’ to start one or more virtual machines. Different images contain different tools, utilities, and pipelines. Users can also create their own custom images containing specific pipelines, tools, and applications of interest.
Figure 2
Figure 2
Major components of the Open Science Data Cloud (OSDC). The OpenFISMA-based application for monitoring, compliance and security is only part of the Bionimbus Protected Data Cloud, and not used by the OSDC in general. FISMA, Federal Information Security Management Act.
Figure 3
Figure 3
Moving large genomics datasets over wide area networks can be difficult. The Open Science Data Cloud (OSDC) supports several protocols for moving datasets, including the open-source UDR protocol, which integrates UDT with rsync and is designed to synchronize large datasets over wide-area, high-performance networks. The figure shows the relative comparison of UDR with rsync when synchronizing the Encyclopedia of DNA Elements (ENCODE) repository between the OSDC in Chicago and the ENCODE Data Coordination Center in Santa Cruz. The transfer speed varies, but UDR consistently has at least four to five times the performance of rsync. ENCODE is open access and UDR without encryption can be used. UDR with encryption enabled (for moving controlled-access data) achieves about 660 Mb/s when transferring data between the Bionimbus Protected Data Cloud in Chicago with a server at the Ontario Institute for Cancer Research in Toronto, Canada. The speed can be increased using disks with higher throughput or using multiple flows to multiple disks.
Figure 4
Figure 4
This figure compares the charges for Amazon Web Services (AWS) S3 storage, with the costs incurred as the Open Science Data Cloud (OSDC) adds 1 PB of storage. The AWS costs were computed using the Simple Monthly Calculator on the AWS web site (/calculator.s3.amazonaws.com/calc5.html) and does not include the cost of accessing the data. We used a simple cost model for the OSDC that includes the capital charges for equipment amortized over 3 years, the operating costs of supplying power, space and cooling, and the operating costs of the staff required to manage the OSDC. This cost model assumes that we are operating a minimum of 20 racks and that we refresh one-third of the racks each year.

References

    1. Mardis ER. The $1,000 genome, the $100,000 analysis? Genome Med 2010;2:84. - PMC - PubMed
    1. Stein LD. The case for cloud computing in genome informatics. Genome Biol 2010;11:207. - PMC - PubMed
    1. Greenbaum D, Gerstein M. The role of cloud computing in managing the deluge of potentially private genetic data. Am J Bioeth 2011;11:39–41 - PubMed
    1. Grossman RL, White KP. A vision for a biomedical cloud. J Intern Med 2012; 271:122–30 - PMC - PubMed
    1. Wall DP, Kudtarkar P, Fusaro VA, et al. . Cloud computing for comparative genomics. BMC Bioinformatics 2010;11:259. - PMC - PubMed

Publication types