Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jan 29;10(2):giaa163.
doi: 10.1093/gigascience/giaa163.

Transcriptome annotation in the cloud: complexity, best practices, and cost

Affiliations

Transcriptome annotation in the cloud: complexity, best practices, and cost

Roberto Vera Alvarez et al. Gigascience. .

Abstract

Background: The NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) initiative provides NIH-funded researchers cost-effective access to commercial cloud providers, such as Amazon Web Services (AWS) and Google Cloud Platform (GCP). These cloud providers represent an alternative for the execution of large computational biology experiments like transcriptome annotation, which is a complex analytical process that requires the interrogation of multiple biological databases with several advanced computational tools. The core components of annotation pipelines published since 2012 are BLAST sequence alignments using annotated databases of both nucleotide or protein sequences almost exclusively with networked on-premises compute systems.

Findings: We compare multiple BLAST sequence alignments using AWS and GCP. We prepared several Jupyter Notebooks with all the code required to submit computing jobs to the batch system on each cloud provider. We consider the consequence of the number of query transcripts in input files and the effect on cost and processing time. We tested compute instances with 16, 32, and 64 vCPUs on each cloud provider. Four classes of timing results were collected: the total run time, the time for transferring the BLAST databases to the instance local solid-state disk drive, the time to execute the CWL script, and the time for the creation, set-up, and release of an instance. This study aims to establish an estimate of the cost and compute time needed for the execution of multiple BLAST runs in a cloud environment.

Conclusions: We demonstrate that public cloud providers are a practical alternative for the execution of advanced computational biology experiments at low cost. Using our cloud recipes, the BLAST alignments required to annotate a transcriptome with ∼500,000 transcripts can be processed in <2 hours with a compute cost of ∼$200-$250. In our opinion, for BLAST-based workflows, the choice of cloud platform is not dependent on the workflow but, rather, on the specific details and requirements of the cloud provider. These choices include the accessibility for institutional use, the technical knowledge required for effective use of the platform services, and the availability of open source frameworks such as APIs to deploy the workflow.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Figure 1:
Figure 1:
Basic components in a cloud-based batch system.
Figure 2:
Figure 2:
Schema of the transcriptome annotation workflow.
Figure 3:
Figure 3:
Transcriptome annotation workflow schema [40].
Figure 4:
Figure 4:
Time and cost for the 10,000 query size files. (a) Total time for each input file for each configuration (cloud provider/machine type/vCPUs). The total cost of processing the 20 input files (200,000 transcripts in total) is at the top of each box using normal and transitory instances. The cost of processing 1 transcript is at the bottom of each box. (b) Time and percent of the total cost for instance creation, set-up, and release. (c) Time and percent of the cost for transferring the BLAST databases to the instance from the cloud storage bucket (S3 in AWS and Cloud Storage in GCP). (d) Time and percent of the cost for the CWL workflow execution. Input files in all plots can be identified by the coloring especified in the top plot legend.
Figure 5:
Figure 5:
Left (a): Total processing time for 120,000 transcripts using different query sizes. Right (b): Total cost using normal compared to transitory instances.

References

    1. NIH STRIDES Initiative. https://cloud.cit.nih.gov/.Accessed 19 January, 2021
    1. Langmead B, Nellore A. Cloud computing for genomic data analysis and collaboration. Nat Rev Genet. 2018;19(4):208–19. - PMC - PubMed
    1. SRA in the Cloud. https://www.ncbi.nlm.nih.gov/sra/docs/sra-cloud/. Accessed 19 January, 2021
    1. Official NCBI BLAST+ Docker Image Documentation. https://github.com/ncbi/blast_plus_docs. Accessed 19 January, 2021
    1. Sayers EW, Agarwala R, Bolton EE, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2020;48(D1):D9–D16. - PMC - PubMed

Publication types