Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2017 Apr 27;18(1):332.
doi: 10.1186/s12864-017-3717-3.

CloVR-Comparative: automated, cloud-enabled comparative microbial genome sequence analysis pipeline

Affiliations
Comparative Study

CloVR-Comparative: automated, cloud-enabled comparative microbial genome sequence analysis pipeline

Sonia Agrawal et al. BMC Genomics. .

Abstract

Background: The benefit of increasing genomic sequence data to the scientific community depends on easy-to-use, scalable bioinformatics support. CloVR-Comparative combines commonly used bioinformatics tools into an intuitive, automated, and cloud-enabled analysis pipeline for comparative microbial genomics.

Results: CloVR-Comparative runs on annotated complete or draft genome sequences that are uploaded by the user or selected via a taxonomic tree-based user interface and downloaded from NCBI. CloVR-Comparative runs reference-free multiple whole-genome alignments to determine unique, shared and core coding sequences (CDSs) and single nucleotide polymorphisms (SNPs). Output includes short summary reports and detailed text-based results files, graphical visualizations (phylogenetic trees, circular figures), and a database file linked to the Sybil comparative genome browser. Data up- and download, pipeline configuration and monitoring, and access to Sybil are managed through CloVR-Comparative web interface. CloVR-Comparative and Sybil are distributed as part of the CloVR virtual appliance, which runs on local computers or the Amazon EC2 cloud. Representative datasets (e.g. 40 draft and complete Escherichia coli genomes) are processed in <36 h on a local desktop or at a cost of <$20 on EC2.

Conclusions: CloVR-Comparative allows anybody with Internet access to run comparative genomics projects, while eliminating the need for on-site computational resources and expertise.

Keywords: Automated analysis; Bioinformatics resource; Cloud computing; Comparative genomics; Microbial genomics; Virtual machine; Whole-genome alignment.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
CloVR-Comparative configuration screen. Three options are available to the user to identify and select annotated genome sequence data as input for CloVR-Comparative: a Using uploaded GenBank files or GenBank files generated by the CloVR-Microbe protocol both of which can be specified by so-called “tags” as described in the CloVR documentation [1]; b Through drag-and-drop in the searchable interactive interface that lists genomes available from RefSeq in a taxonomic tree format; and c By specifying a list of comma-separated GenBank accession numbers
Fig. 2
Fig. 2
Overview and flowchart of CloVR-Comparative. Input data in the form of annotated genomes in GenBank format is first validated and converted into other file formats, then used in whole-genome alignment (WGA) with Mugsy and alignment of translated CDS with MUSCLE to determine COGs. WGAs are used to identify SNPs and to predict phylogenetic relationships based on core genomic regions with Phylomark. From the results individual circular plots are generated for each input genome. The analysis output is loaded into a Sybil database to provide searches of comparative genome data in a web browser, summary and detailed results reports, and tree and circular figures
Fig. 3
Fig. 3
Example of a circular figure output. The figure was generated with Circleator using the example test dataset from the project website as input. It uses Neisseria meningitidis alpha14 (GenBank accession number: NC_013016) as a reference and depicts from outside to inside (1) complete genome (contigs of draft assemblies would be sorted by size); (2, 3) CDSs on forward and reverse strands; (4) core CDSs, defined as COGs that are shared between all input genomes; (5) unique CDSs that are only present in the reference genome (i.e. S. Typhimurium LT2 in this case); (6) unique SNPs, defined as being part of the core genome shared between all input genomes but containing a nucleotide in the reference genome that is different from all other input genomes; (7) G + C content in percent with maximum value shown as gray dotted line, calculated using non-overlapping windows of 5kbp length; and (8) GC skew, with maximum value shown as gray dotted line, calculated as (G - C) / (G + C) where G and C are nucleotide counts over non-overlapping windows of 5kbp length
Fig. 4
Fig. 4
Whole-genome alignment-based phylogenetic tree of 40 E. coli genomes from different phylogroups. Reference genomes from eight E. coli strains (see [9] for GenBank accession numbers) were used as input for CloVR-Comparative. Colored boxes and phylogroup assignments were manually added to the automatically generated tree in Newick format
Fig. 5
Fig. 5
Screenshot of the phsABCD gene cluster comparison between different S. enterica serotypes. The screenshot from the Sybil comparative analysis tool highlights the phs operon that encodes the enzymes for the anaerobic production of hydrogen sulfide from thiosulfate, which are used in anaerobic respiration. The comparison shows that of the four genes that are present in S. Typhimurium LT2, two (phsA and phsD) are missing from the two S. Paratyphi A strains AKU 12601 and ATCC 9150 and one (phsD) from the two S. Typhi strains CT18 and Ty2. The corresponding genomic regions were manually checked and confirmed to contain interrupted open reading frames in those genomes without gene calls. Gene designations in red were manually added to the screenshot that was directly copied from Sybil browser
Fig. 6
Fig. 6
Screenshot of the torSTRCAD gene cluster comparison between different S. enterica serotypes. The Sybil screenshot highlights the tor gene cluster that is responsible for the reduction of trimethylamine oxide (TMAO) to trimethylamine, which is used in anaerobic respiration. The comparison shows that of the six genes that are present in S. Typhimurium LT2 at least one, in several cases two are missing from S. Choleraesuis SC B67 (torT), S. Gallinarum 287/91 (torS), S. Paratyphi A RKS4594 (torTR), S. Typhi CT18 (torRC) and Ty2 (torR). The corresponding genomic regions were manually checked and confirmed to contain interrupted open reading frames in those genomes without gene calls. Gene designations in red were manually added to the screenshot that was directly copied from Sybil browser. Gene designations in red were manually added to the screenshot that was directly copied from Sybil browser

References

    1. Angiuoli SV, Matalka M, Gussman A, Galens K, Vangala M, Riley DR, Arze C, White JR, White O, Fricke WF. CloVR: A virtual machine for automated and portable sequence analysis from the desktop using cloud computing. BMC Bioinformatics. 2011;12:356. doi: 10.1186/1471-2105-12-356. - DOI - PMC - PubMed
    1. Angiuoli SV, White JR, Matalka M, White O, Fricke WF. Resources and costs for microbial sequence analysis evaluated using virtual machines and cloud computing. PLoS One. 2011;6(10) doi: 10.1371/journal.pone.0026624. - DOI - PMC - PubMed
    1. Galens K, White JR, Arze C, Matalka M, Giglio MG, Team TC, Angiuoli SV, Fricke WF. Nature Preceding. 2011. CloVR-Microbe: Assembly, gene finding and functional annotation of raw sequence data from single microbial genome projects – standard operating procedure, version 1.0.
    1. Angiuoli SV, Salzberg SL. Mugsy: fast multiple alignment of closely related whole genomes. Bioinformatics. 2011;27(3):334–342. doi: 10.1093/bioinformatics/btq665. - DOI - PMC - PubMed
    1. Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL. Versatile and open software for comparing large genomes. Genome Biol. 2004;5(2):R12. doi: 10.1186/gb-2004-5-2-r12. - DOI - PMC - PubMed

Publication types

LinkOut - more resources