Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Aug 30:12:356.
doi: 10.1186/1471-2105-12-356.

CloVR: a virtual machine for automated and portable sequence analysis from the desktop using cloud computing

Affiliations

CloVR: a virtual machine for automated and portable sequence analysis from the desktop using cloud computing

Samuel V Angiuoli et al. BMC Bioinformatics. .

Abstract

Background: Next-generation sequencing technologies have decentralized sequence acquisition, increasing the demand for new bioinformatics tools that are easy to use, portable across multiple platforms, and scalable for high-throughput applications. Cloud computing platforms provide on-demand access to computing infrastructure over the Internet and can be used in combination with custom built virtual machines to distribute pre-packaged with pre-configured software.

Results: We describe the Cloud Virtual Resource, CloVR, a new desktop application for push-button automated sequence analysis that can utilize cloud computing resources. CloVR is implemented as a single portable virtual machine (VM) that provides several automated analysis pipelines for microbial genomics, including 16S, whole genome and metagenome sequence analysis. The CloVR VM runs on a personal computer, utilizes local computer resources and requires minimal installation, addressing key challenges in deploying bioinformatics workflows. In addition CloVR supports use of remote cloud computing resources to improve performance for large-scale sequence processing. In a case study, we demonstrate the use of CloVR to automatically process next-generation sequencing data on multiple cloud computing platforms.

Conclusion: The CloVR VM and associated architecture lowers the barrier of entry for utilizing complex analysis protocols on both local single- and multi-core computers and cloud systems for high throughput data processing.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Schematic of the automated pipelines provided in the CloVR virtual machine. The CloVR virtual machine includes pre-packaged automated pipelines for analyzing raw sequence data on both a local computer and cloud computing platform. The primary steps in each of the four CloVR protocols are shown (light blue) along with input data (pink) and reference databases (green).
Figure 2
Figure 2
Architecture of the CloVR application. CloVR provides a virtual machine (VM) that is run on user's local desktop or laptop computer. The user interacts with the local VM via a command line or web interface to execute pipelines. Optionally, clusters of additional VM instances are provisioned on supported cloud platforms for increased throughput. Each cluster has a master VM instance that provides services for GridEngine [36] and Hadoop [37]. Input data and output data is transferred between the local VM and a master VM instance in the cloud over the Internet.
Figure 3
Figure 3
Components of the CloVR virtual machine. The CloVR virtual machine (blue) includes pre-installed and pre-configured software dependencies on an Ubuntu operating system to support execution on a local desktop computer and the cloud (yellow). Key software that is bundled with the VM is shown. The asterisk indicates software that was developed as part of the CloVR project.
Figure 4
Figure 4
Steps of an automated pipeline in CloVR. A pipeline executing on the local client VM is comprised of seven primary steps. The primary API functions invoked during each step are shown with the prefix 'vp-'. For cloud-based execution, a worker pipeline is executed remotely on one or more CloVR VM instances on the cloud. The local client VM monitors the worker pipeline and VM instances on the cloud. Upon pipeline completion, output data is automatically downloaded to the local VM for viewing or post processing.
Figure 5
Figure 5
Example of a specification file used to configure pipeline execution.
Figure 6
Figure 6
Execution profile of an analysis with CloVR-Microbe. CloVR-Microbe was used to perform whole-genome shotgun (WGS) assembly and annotation on 500,000 3 kbp paired-end sequence reads generated with the 454 Titanium FLX platform from a Escherichia coli whole-genome shotgun library (unpublished data). The local VM client first started a remote (master) VM instance on the cloud. The input sequencing reads (676 MB, compressed SFF file) were copied to this instance and assembled on a single c1.xlarge VM instance, using no more than no more than two out of the eight available CPUs. Then, prior to the genome annotation, which involves several parallelizable search steps, 15 additional CloVR VM instances were allocated to improve processing throughput. A configurable parameter limits the number of instances that are added. Idle instances are subsequently terminated automatically upon job completion on an hourly timer.
Figure 7
Figure 7
Dynamic allocation of CloVR VM instances to a cluster on the cloud running BLAST. A cluster of CloVR VMs is deployed on-the-fly and scaled to 160 c1.xlarge Amazon EC2 instances (totaling 1280 virtualized CPUs) running BLAST of a random sample of ~100 Million nucleotides from metagenomic whole-genome shotgun sequencing with 454 Titanium FLX of an unpublished oral microbiome project against the NCBI non-redundant protein database.
Figure 8
Figure 8
Visualization of data transfers between instances over time in a cluster of CloVR VMs. Each segment of the circle represents the lifetime of a single CloVR VM instance. Labels indicate time since bootup in wallclock hours. The red segment represents the master node CloVR VM and the grey segments the worker VM instances. Data transfers between master and worker instances are shown as grey lines. Transfers between worker instances are shown as blue lines.
Figure 9
Figure 9
Network throughput on a cluster of CloVR VMs on Amazon EC2. The aggregate network throughput as measured by Ganglia [59] during a peer-to-peer data transfer on a cluster of 160 c1.xlarge instances on Amazon EC2.

References

    1. Next Generation Genomics: World Map of High-throughput Sequencers. http://pathogenomics.bham.ac.uk/hts/ http://pathogenomics.bham.ac.uk/hts/
    1. Kahn SD. On the future of genomic data. Science. 2011;331:728–729. doi: 10.1126/science.1197891. - DOI - PubMed
    1. Field D, Tiwari B, Booth T, Houten S, Swan D, Bertrand N, Thurston M. Open software for biologists: from famine to feast. Nat Biotechnol. 2006;24:801–803. doi: 10.1038/nbt0706-801. - DOI - PubMed
    1. Mesirov JP. Computer science. Accessible reproducible research. Science. 2010;327:415–416. doi: 10.1126/science.1179653. - DOI - PMC - PubMed
    1. Schadt EE, Linderman MD, Sorenson J, Lee L, Nolan GP. Computational solutions to large-scale data management and analysis. Nat Rev Genet. 2010;11:647–657. - PMC - PubMed

Publication types

LinkOut - more resources