Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Apr;28(4):840-54.
doi: 10.1105/tpc.15.00933. Epub 2016 Mar 28.

xGDBvm: A Web GUI-Driven Workflow for Annotating Eukaryotic Genomes in the Cloud

Affiliations

xGDBvm: A Web GUI-Driven Workflow for Annotating Eukaryotic Genomes in the Cloud

Jon Duvick et al. Plant Cell. 2016 Apr.

Abstract

Genome-wide annotation of gene structure requires the integration of numerous computational steps. Currently, annotation is arguably best accomplished through collaboration of bioinformatics and domain experts, with broad community involvement. However, such a collaborative approach is not scalable at today's pace of sequence generation. To address this problem, we developed the xGDBvm software, which uses an intuitive graphical user interface to access a number of common genome analysis and gene structure tools, preconfigured in a self-contained virtual machine image. Once their virtual machine instance is deployed through iPlant's Atmosphere cloud services, users access the xGDBvm workflow via a unified Web interface to manage inputs, set program parameters, configure links to high-performance computing (HPC) resources, view and manage output, apply analysis and editing tools, or access contextual help. The xGDBvm workflow will mask the genome, compute spliced alignments from transcript and/or protein inputs (locally or on a remote HPC cluster), predict gene structures and gene structure quality, and display output in a public or private genome browser complete with accessory tools. Problematic gene predictions are flagged and can be reannotated using the integrated yrGATE annotation tool. xGDBvm can also be configured to append or replace existing data or load precomputed data. Multiple genomes can be annotated and displayed, and outputs can be archived for sharing or backup. xGDBvm can be adapted to a variety of use cases including de novo genome annotation, reannotation, comparison of different annotations, and training or teaching.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Overview of xGDBvm as Implemented at CyVerse (iPlant). xGDBvm is a virtual server environment for gene structure annotation that can be cloned, configured, populated with input data, and run from a Web browser in a few steps, as summarized here. (A) Log in to the CyVerse Atmosphere Control Panel (https://atmo.iplantcollaborative.org/application) (1) and click to create a new instance (cloned copy) of xGDBvm (2), create a block storage volume for output data, and attach it to the instance (3). Open a Web shell interface (4), accessible from the Control Panel, and type a series of commands to set up and configure the new xGDBvm instance, also mounting the Data Store and the attached volume. (B) Log in to the CyVerse Data Store cloud storage system (https://de.iplantcollaborative.org/de/) and upload input data files to an input data directory (accessible to the VM) using a batch uploading tool. Naming conventions are used to identify each input type. (C) Log in to the xGDBvm instance’s GUI using HTTPS via its unique IP address or using a VNC (1). All subsequent steps are performed using the xGDBvm GUI. Authorize the VM to connect to remote HPC resources via the Agave API (http://agaveapi.co) (2). Configure the path to Data Store inputs and set other parameters including remote job execution (optional). xGDBvm will validate files, return expected outputs, and flag any input file errors (3). Initiate automated workflows and monitor progress (4). The workflow sends some data remotely for processing on HPC resources (https://www.xsede.org/) managed by Agave APIs and processes other files locally using the attached volume as a scratch disk. The xGDBvm workflow waits for HPC outputs and then proceeds with the annotation process. Output data are written to the external volume and can be accessed from xGDBvm Web browser as GDB001, GDB002, etc. (5). In addition to a fully featured genome browser, xGDBvm includes tools to query, update, reannotate, download, or archive outputs to the user’s Data Store. For details, refer to the xGDBvm wiki (http://goblinx.soic.indiana.edu/wiki/doku.php).
Figure 2.
Figure 2.
Data Process Schema. Input data types (with standardized names as indicated), computational modules, and outputs are shown. Images are screenshots of color-coded track glyph types (gene models; splice alignments) and track flags (quality scores) displayed in the xGDBvm genome browser.
Figure 3.
Figure 3.
xGDBvm Architecture. An xGDBvm VM instance, as hosted on the CyVerse Atmosphere cloud infrastructure (https://atmo.iplantcollaborative.org/application), has separate file system partitions under root (containing the xGDBvm Web GUI, scripts, binaries, and other software) and /home/ (which is configured with mount points for the user’s Data Store home directory for data input and a block storage volume for data output). The Agave API, hosted by the CyVerse Discovery Environment, is used for authentication of the VM via OAuth2 and for management of HPC applications and job submission. A key feature of xGDBvm is the ability to attach and mount the output volume to a different VM and reconstitute the annotation outputs and display. See text for details.
Figure 4.
Figure 4.
xGDBvm Data Management. (A) Screenshot of the GDB Configuration page, set up for processing Example data. Each genome annotation is assigned a unique identifier (GDB001, GDB002, etc.) and a user-provided name. In addition to form fields for input data path, annotation parameters, and metadata, this page provides extensive color-coded information about all system settings (e.g., license keys, storage capacity, and login status, displayed in blue-green), input data validity (light green), and expected output (orange). The form includes buttons that launch modal windows to initiate computational workflow or edit configuration. (B) Screenshot of Archive/Delete menu, showing genome databases with “Current” (blue; computation complete) or “Development” (gray; not yet run) status. Genome annotations are identified as GDB001, GDB002, etc. Each table row displays information about a GDB including time stamps as well as action buttons that allow the user to drop, delete, archive, delete archive, or copy database (see text for details). Global action buttons (top right) allow the user to delete or archive all data on the VM. (C) Screenshot of “List All Jobs” page with tools to monitor and manage remote HPC jobs. The page displays IDs, job metadata, time stamps, color-coded status indicators, and action buttons to manage output (Stop Job, Delete Job, View Logs, Copy Output) via the Agave API. See text for details.
Figure 5.
Figure 5.
Genome Context View. Shown is a typical region from the C. rubella genome annotation described in Results. Genome span is shown in yellow, and genome features (tracks) are as labeled to the left and above each track. Drag-and-drop reorder and “hide track” features are implemented here. Top bar provides search and navigation controls; left bar contains links to tools and views, as well as to configuration and help pages. Region submenu (orange) contains zoom/scroll, region-specific tools, and formatting controls. See Table 1 for details of xGDBvm tools and features.
Figure 6.
Figure 6.
Gene Model Improvement Using yrGATE. (A) A published gene model from C. rubella (Carubv1011418m.g) showing high coverage/low integrity in the Locus Table (upper table, highlighted columns). (B) Corresponding gene model in genome context view (blue glyph). CpGAT annotated this region as two distinct loci (magenta glyph), backed up by both Arabidopsis protein (black) and cDNA (light blue). The region was then reannotated using yrGATE (dark and light green glyphs) to confirm the most probably genic structure of this region based on available evidence. yrGATE glyphs are color-coded according to the type assigned by the annotator, e.g., dark green (improved structure) and light green (new structure not previously annotated).

Similar articles

Cited by

References

    1. Abouelhoda M.I., Kurtz S., Ohlebusch E. (2002). The enhanced suffix array and its applications to genome analysis. In Second Workshop on Algorithms in Bioinformatics, R. Guigo and D. Gusfield, eds (Rome:Springer-Verlag; ), pp. 449–463.
    1. Borodovsky M., Lomsadze A. (2011). Eukaryotic gene prediction using GeneMark.hmm-E and GeneMark-ES. Curr. Protoc. Bioinformatics 4: 4.6.1–4.6.10. - PMC - PubMed
    1. Cantarel B.L., Korf I., Robb S.M., Parra G., Ross E., Moore B., Holt C., Sánchez Alvarado A., Yandell M. (2008). MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 18: 188–196. - PMC - PubMed
    1. Dooley R., Vaughn M., Stanzione D., Terry S., Skidmore E. (2012). Software-as-a-Service: The iPlant Foundation API. In 5th IEEE Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) (IEEE; ).
    1. Foissac S., Gouzy J.P., Rombauts S., Mathé C., Amselem J., Sterck L., Van de Peer Y., Rouzé P., Schiex T. (2008). Genome annotation in plants and fungi: EuGene as a model platform. Curr. Bioinform. 3: 87–97.

Publication types