Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jan 8;49(D1):D817-D824.
doi: 10.1093/nar/gkaa846.

ViruSurf: an integrated database to investigate viral sequences

Affiliations

ViruSurf: an integrated database to investigate viral sequences

Arif Canakoglu et al. Nucleic Acids Res. .

Abstract

ViruSurf, available at http://gmql.eu/virusurf/, is a large public database of viral sequences and integrated and curated metadata from heterogeneous sources (RefSeq, GenBank, COG-UK and NMDC); it also exposes computed nucleotide and amino acid variants, called from original sequences. A GISAID-specific ViruSurf database, available at http://gmql.eu/virusurf_gisaid/, offers a subset of these functionalities. Given the current pandemic outbreak, SARS-CoV-2 data are collected from the four sources; but ViruSurf contains other virus species harmful to humans, including SARS-CoV, MERS-CoV, Ebola and Dengue. The database is centered on sequences, described from their biological, technological and organizational dimensions. In addition, the analytical dimension characterizes the sequence in terms of its annotations and variants. The web interface enables expressing complex search queries in a simple way; arbitrary search queries can freely combine conditions on attributes from the four dimensions, extracting the resulting sequences. Several example queries on the database confirm and possibly improve results from recent research papers; results can be recomputed over time and upon selected populations. Effective search over large and curated sequence data may enable faster responses to future threats that could arise from new viruses.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Logical schema of the relational database in the back-end of ViruSurf.
Figure 2.
Figure 2.
General pipeline of the ViruSurf platform. For given sources and species, we use download procedures to construct content, perform data curation, and load the content into two distinct databases, for GISAID and for the other sources, which are schema-compatible (the former is a subset of the latter). We then provide two Web-based interfaces supporting search and result inspection.
Figure 3.
Figure 3.
Counts of SARS-CoV-2 overlapping sequences from each source. Overlaps are computed by means of either the strain name, or both strain name and length.
Figure 4.
Figure 4.
Overview of ViruSurf interface. Part 1 (Top bar) allows to reset the previously chosen query or select predefined example queries. Queries are composed by using Part 2 (Metadata search) and Part 3 (Variants search). In our example, Part 2 includes three filters on Virus taxon name, Is complete and N%. Part 3 includes three panels. Panel ‘A’ is a query on amino acid variants, selecting sequences with RK and GR changes in gene N; Panel B’ is a query on nucleotide variants, selecting sequences with a variant at position 28 881. Panels ‘A’ and ‘B’ are closed, they can be removed but not changed. Panel ‘C’ is another query on amino acid variants, currently open; it includes two filters selecting given positions of the Spike protein, and visualizes available values for the original amino acid involved in the change. Part 4 shows the Result Visualization. Resulting sequences already reflect the filters of Part 2 and the queries of the closed panels ‘A’ and ‘B’ of Part 3, applied in conjunction. Results can be downloaded, in csv or FASTA format; they can be selected as either cases (default) or controls (switch), and both the nucleotide and amino acid sequences can be projected on a specific protein; table columns can be omitted and reordered. On the bottom right corner, the number of sequences resulting from the search is visualized (in the Figure we show only three sequences out of 14 sequences found).

References

    1. Bernasconi A., Canakoglu A., Pinoli P., Ceri S.. Empowering Virus Sequences Research through Conceptual Modeling. 39th InternationalConference on Conceptual Model, Nov. 2020. 2020;
    1. Bernasconi A., Ceri S., Campi A., Masseroli M.. Mayr H.C., Guizzardi G., Ma H., Pastor O.. Conceptual Modeling for Genomics: Building an Integrated Repository of Open Data. Conceptual Modeling. 2017; Cham: Springer International Publishing; 325–339.
    1. Bernasconi A., Canakoglu A., Masseroli M., Ceri S.. META-BASE: a Novel Architecture for Large-Scale Genomic Metadata Integration. IEEE/ACM Trans. Comput. Biol. Bioinform. 2020; https://ieeexplore.ieee.org/document/9104916. - PubMed
    1. Canakoglu A., Bernasconi A., Colombo A., Masseroli M., Ceri S.. GenoSurf: metadata driven semantic search system for integrated genomic datasets. Database. 2019; 2019:baz132. - PMC - PubMed
    1. O’Leary N.A., Wright M.W., Brister J.R., Ciufo S., Haddad D., McVeigh R., Rajput B., Robbertse B., Smith-White B., Ako-Adjei D. et al. .. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016; 44:D733–D745. - PMC - PubMed

Publication types