Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011;6(10):e26624.
doi: 10.1371/journal.pone.0026624. Epub 2011 Oct 19.

Resources and costs for microbial sequence analysis evaluated using virtual machines and cloud computing

Affiliations

Resources and costs for microbial sequence analysis evaluated using virtual machines and cloud computing

Samuel V Angiuoli et al. PLoS One. 2011.

Abstract

Background: The widespread popularity of genomic applications is threatened by the "bioinformatics bottleneck" resulting from uncertainty about the cost and infrastructure needed to meet increasing demands for next-generation sequence analysis. Cloud computing services have been discussed as potential new bioinformatics support systems but have not been evaluated thoroughly.

Results: We present benchmark costs and runtimes for common microbial genomics applications, including 16S rRNA analysis, microbial whole-genome shotgun (WGS) sequence assembly and annotation, WGS metagenomics and large-scale BLAST. Sequence dataset types and sizes were selected to correspond to outputs typically generated by small- to midsize facilities equipped with 454 and Illumina platforms, except for WGS metagenomics where sampling of Illumina data was used. Automated analysis pipelines, as implemented in the CloVR virtual machine, were used in order to guarantee transparency, reproducibility and portability across different operating systems, including the commercial Amazon Elastic Compute Cloud (EC2), which was used to attach real dollar costs to each analysis type. We found considerable differences in computational requirements, runtimes and costs associated with different microbial genomics applications. While all 16S analyses completed on a single-CPU desktop in under three hours, microbial genome and metagenome analyses utilized multi-CPU support of up to 120 CPUs on Amazon EC2, where each analysis completed in under 24 hours for less than $60. Representative datasets were used to estimate maximum data throughput on different cluster sizes and to compare costs between EC2 and comparable local grid servers.

Conclusions: Although bioinformatics requirements for microbial genomics depend on dataset characteristics and the analysis protocols applied, our results suggests that smaller sequencing facilities (up to three Roche/454 or one Illumina GAIIx sequencer) invested in 16S rRNA amplicon sequencing, microbial single-genome and metagenomics WGS projects can achieve cost-efficient bioinformatics support using CloVR in combination with Amazon EC2 as an alternative to local computing centers.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Overview of the CloVR-supported microbial sequence analysis protocols.
1. CloVR-16S supports analysis of pyrotagged amplicon pool sequence data as well as individual samples sequence data, using components from the Mothur package for preprocessing, alignment, operational taxonomic unit (OTU) assignment and alpha diversity estimation. QIIME components are used for sequence clustering, alignment, phylogenetic inference and beta diversity estimation. Sequence reads are assigned to taxonomies using the RDP classifier . Additional visualizations are generated with R script implemented in CloVR. Differentially abundant taxa determined with Metastats . 2. CloVR-Metagenomics supports functional and taxonomic assignments of non-redundant whole-genome shotgun (WGS) sequence data from metagenomic samples. Reads are classified based on BLASTX and BLASTN searches against functional (COG , optionally KEGG , eggNOG [47]) and taxonomic (RefSeq [25]) reference databases, respectively. The results are statistically evaluated using Metastats and visualized using R scripts implemented in CloVR. 3. CloVR-Microbe supports microbial whole-genome sequencing projects, including Illumina and 454 or Sanger sequence assembly with Velvet and Celera assembler (CA) , respectively. Gene predictions and annotations are performed using the complex IGS standard operating procedure for automated prokaryotic annotation (IGS) .
Figure 2
Figure 2. Cost and performance of CloVR-Microbe using different cluster sizes.
A) Steps of the CloVR-Microbe pipeline can be executed in parallel to improve performance as shown by plotting pipeline runtimes (blue) and associated costs (red) against the number of CPUs used to perform the analysis on Amazon EC2. B) Using this data, the theoretical maximum throughput per year (blue) as well as associated costs (red) of analysis can be extrapolated. As an example, the output of a single 454 GS FLX Titanium machine, run every other day with two single microbial genomes per sequencing plate (365 total runs), can be processed on Amazon EC2 using 60 CPUs (or eight Amazon EC2 c1.xlarge instances) for less than $25,000, as indicated by the dashed red and blue lines. Inefficiencies in pipeline implementation resulted in increased competition for resources, longer runtimes, and thus increased costs for clusters containing 2 and 3 instances (16 and 24 CPUs, respectively).
Figure 3
Figure 3. Costs and throughput of CloVR-16S, CloVR-Metagenomics and CloVR-Microbe analysis runs.
Costs for single CloVR-16S (blue), CloVR-Metagenomics (red) and CloVR-Microbe (black) runs of comparable datasets (∼500 K 454 GS FLX or GS FLX Titanium reads, see Table 1) on Amazon EC2 were extrapolated to calculate the number of runs that are obtainable for a given dollar value. The black dashed line represents the average annual cost ($130 K) to set up and maintain a local cluster of 240 CPUs for a three years from Dudley et al. . Numbers in boxes show how many runs of CloVR-16S, -Metagenomics, and -Microbe can be afforded for the same cost. As an example, approximately three 454 GS FLX Titanium sequencers (two genomes per sequencing plate and one run per day, adding up to 2,190 datasets) or one Illumina GAIIx sequencer (five genomes per lane, eight lanes per sequencing flow cell and one run per week, adding up to 2,080 datasets) can be processed with CloVR-Microbe on Amazon EC2 annually for the same cost as estimated to set up and maintain the 240 CPU local cluster. The local cluster would, however, provide resources exceeding those required for each of the projected analysis protocols.
Figure 4
Figure 4. Predicted runtimes using varying bid prices for the Amazon EC2 spot market.
An analysis requiring 120 CPU hours was used an example to estimate the expected completion time for different bid prices for the Amazon EC2 c1.xlarge instance, ranging from $0.27 to $0.80 (on-demand price: $0,68).

References

    1. Bailey RC. Grand challenge commentary: Informative diagnostics for personalized medicine. Nat Chem Biol. 2010;6:857–859. - PMC - PubMed
    1. Green ED, Guyer MS. Charting a course for genomic medicine from base pairs to bedside. Nature. 2011;470:204–213. - PubMed
    1. Guttmacher AE, McGuire AL, Ponder B, Stefansson K. Personalized genomic information: preparing for the future of genetic medicine. Nat Rev Genet. 2010;11:161–165. - PubMed
    1. Chin CS, Sorenson J, Harris JB, Robins WP, Charles RC, et al. The origin of the Haitian cholera outbreak strain. N Engl J Med. 2011;364:33–42. - PMC - PubMed
    1. Rusk N. Torrents of sequence. Nature Methods. 2011;8:44.

Publication types