Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2010 May 18:11:259.
doi: 10.1186/1471-2105-11-259.

Cloud computing for comparative genomics

Affiliations
Comparative Study

Cloud computing for comparative genomics

Dennis P Wall et al. BMC Bioinformatics. .

Abstract

Background: Large comparative genomics studies and tools are becoming increasingly more compute-expensive as the number of available genome sequences continues to rise. The capacity and cost of local computing infrastructures are likely to become prohibitive with the increase, especially as the breadth of questions continues to rise. Alternative computing architectures, in particular cloud computing environments, may help alleviate this increasing pressure and enable fast, large-scale, and cost-effective comparative genomics strategies going forward. To test this, we redesigned a typical comparative genomics algorithm, the reciprocal smallest distance algorithm (RSD), to run within Amazon's Elastic Computing Cloud (EC2). We then employed the RSD-cloud for ortholog calculations across a wide selection of fully sequenced genomes.

Results: We ran more than 300,000 RSD-cloud processes within the EC2. These jobs were farmed simultaneously to 100 high capacity compute nodes using the Amazon Web Service Elastic Map Reduce and included a wide mix of large and small genomes. The total computation time took just under 70 hours and cost a total of $6,302 USD.

Conclusions: The effort to transform existing comparative genomics algorithms from local compute infrastructures is not trivial. However, the speed and flexibility of cloud computing environments provides a substantial boost with manageable cost. The procedure designed to transform the RSD algorithm into a cloud-ready application is readily adaptable to similar comparative genomics problems.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The reciprocal smallest distance algorithm (RSD). Arrows denote bidirectional BLAST runs. After each run, hits are paired with the query to calculate evolutionary distances. If the same pair produces the smallest distance in both search directions, it is assumed to be orthologous. The specifics of the algorithm are provided in the Introduction.
Figure 2
Figure 2
Example of the Compute Cloud user interface for monitoring the health of the cluster and progress of mapped cloud tasks. (A) The Cluster summary provided a summary of the compute cloud. (B) Running jobs listed the Job id of the current running task, root user, job name and map task progress update. (C) Completed Jobs provided an up-to-date summary of completed tasks. This user interface also provided information about failed steps as well as links to individual job logs and histories. Access to this user interface was through FoxyProxy, described in the Methods.
Figure 3
Figure 3
Example of the Job user interface for monitoring the status of individual jobs. (A) Job summary provided job information like the user, job start time and the duration of the job. (B) Job status gave the task completion rate and failure reporting. (C) Job Counter indicated job progress and additional counter. The progression of the mapper was also displayed graphically at the bottom of web UI page (not shown here). Access to this user interface was through FoxyProxy, described in the Methods.
Figure 4
Figure 4
Workflow for establishment and execution of the reciprocal smallest distance algorithm using the Elastic MapReduce framework on the Amazon Elastic Compute Cloud (EC2). (1) Preconfiguration involves the general setup and porting of the RSD program and genomes to the Amazon S3, and configuration of the Mappers for executing the BLAST and RSD runs within the cluster. (2) Instantiation specifies the Amazon EC2 instance type (e.g. small, medium, or large), logging of cloud cluster performance, and preparation of the runner files as described in the Methods. (3) Job Flow Execution launches the processes across the cluster using the command-line arguments indicated in Table 1. This is done for the Blast and RSD steps separately. (4) The All-vs-All BLAST utilizes the BLAST runner and BLAST mapper to generate a complete set of results for all genomes under consideration. (5) The Ortholog computation step utilizes the RSD runner file and RSD mapper to estimate orthologs and evolutionary distances for all genomes under study. This step utilizes the stored BLAST results from step 4 and can be run asynchronously, at any time after the BLAST processes complete. The Amazon S3 storage bucket was used for persistent storage of BLAST and RSD results. The Hadoop Distributed File System (HDFS) was used for local storage of genomes, and genome-specific BLAST results for faster I/O when running the RSD step. Additional details are provided in the Methods.
Figure 5
Figure 5
Example of the mapper program used to run the BLAST and ortholog estimation steps required by the reciprocal smallest distance algorithm (RSD). This example assumes a runner file containing precise command line arguments for executing the separate steps of the RSD algorithm. The programs were written in python.

References

    1. Wall DP, Fraser HB, Hirsh AE. Detecting putative orthologs. Bioinformatics. 2003;19(13):1710–1. doi: 10.1093/bioinformatics/btg213. - DOI - PubMed
    1. Altschul SF. et al.Basic local alignment search tool. Mol Biol. 1990;215(3):403–10. - PubMed
    1. Chenna R. et al.Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res. 2003;31(13):3497–500. doi: 10.1093/nar/gkg500. - DOI - PMC - PubMed
    1. Yang Z. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci. 1997;13(5):555–6. - PubMed
    1. Jones DT, Taylor WR, Thornton JM. The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci. 1992;8(3):275–82. - PubMed

Publication types

LinkOut - more resources