. 2021 Jan 8;49(D1):D817-D824.

doi: 10.1093/nar/gkaa846.

ViruSurf: an integrated database to investigate viral sequences

Arif Canakoglu¹, Pietro Pinoli¹, Anna Bernasconi¹, Tommaso Alfonsi¹, Damianos P Melidis², Stefano Ceri¹

Affiliations

¹ Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Via Ponzio 34/5, 20133 Milano, Italy.
² L3S Research Center, Leibniz University Hannover, Appelstr. 9a, 30167 Hannover, Germany.

PMID: 33045721
PMCID: PMC7778888
DOI: 10.1093/nar/gkaa846

ViruSurf: an integrated database to investigate viral sequences

Arif Canakoglu et al. Nucleic Acids Res. 2021.

. 2021 Jan 8;49(D1):D817-D824.

doi: 10.1093/nar/gkaa846.

Authors

Arif Canakoglu¹, Pietro Pinoli¹, Anna Bernasconi¹, Tommaso Alfonsi¹, Damianos P Melidis², Stefano Ceri¹

Affiliations

¹ Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Via Ponzio 34/5, 20133 Milano, Italy.
² L3S Research Center, Leibniz University Hannover, Appelstr. 9a, 30167 Hannover, Germany.

PMID: 33045721
PMCID: PMC7778888
DOI: 10.1093/nar/gkaa846

Abstract

ViruSurf, available at http://gmql.eu/virusurf/, is a large public database of viral sequences and integrated and curated metadata from heterogeneous sources (RefSeq, GenBank, COG-UK and NMDC); it also exposes computed nucleotide and amino acid variants, called from original sequences. A GISAID-specific ViruSurf database, available at http://gmql.eu/virusurf_gisaid/, offers a subset of these functionalities. Given the current pandemic outbreak, SARS-CoV-2 data are collected from the four sources; but ViruSurf contains other virus species harmful to humans, including SARS-CoV, MERS-CoV, Ebola and Dengue. The database is centered on sequences, described from their biological, technological and organizational dimensions. In addition, the analytical dimension characterizes the sequence in terms of its annotations and variants. The web interface enables expressing complex search queries in a simple way; arbitrary search queries can freely combine conditions on attributes from the four dimensions, extracting the resulting sequences. Several example queries on the database confirm and possibly improve results from recent research papers; results can be recomputed over time and upon selected populations. Effective search over large and curated sequence data may enable faster responses to future threats that could arise from new viruses.

PubMed Disclaimer

Figures

**Figure 1.**
Logical schema of the relational database in the back-end of ViruSurf.

**Figure 2.**
General pipeline of the ViruSurf platform. For given sources and species, we use download procedures to construct content, perform data curation, and load the content into two distinct databases, for GISAID and for the other sources, which are schema-compatible (the former is a subset of the latter). We then provide two Web-based interfaces supporting search and result inspection.

**Figure 3.**
Counts of SARS-CoV-2 overlapping sequences from each source. Overlaps are computed by means of either the strain name, or both strain name and length.

**Figure 4.**
Overview of ViruSurf interface. Part 1 (Top bar) allows to reset the previously chosen query or select predefined example queries. Queries are composed by using Part 2 (Metadata search) and Part 3 (Variants search). In our example, Part 2 includes three filters on Virus taxon name, Is complete and N%. Part 3 includes three panels. Panel ‘A’ is a query on amino acid variants, selecting sequences with RK and GR changes in gene N; Panel B’ is a query on nucleotide variants, selecting sequences with a variant at position 28 881. Panels ‘A’ and ‘B’ are closed, they can be removed but not changed. Panel ‘C’ is another query on amino acid variants, currently open; it includes two filters selecting given positions of the Spike protein, and visualizes available values for the original amino acid involved in the change. Part 4 shows the Result Visualization. Resulting sequences already reflect the filters of Part 2 and the queries of the closed panels ‘A’ and ‘B’ of Part 3, applied in conjunction. Results can be downloaded, in csv or FASTA format; they can be selected as either cases (default) or controls (switch), and both the nucleotide and amino acid sequences can be projected on a specific protein; table columns can be omitted and reordered. On the bottom right corner, the number of sequences resulting from the search is visualized (in the Figure we show only three sequences out of 14 sequences found).

See this image and copyright information in PMC

References

1. Bernasconi A., Canakoglu A., Pinoli P., Ceri S.. Empowering Virus Sequences Research through Conceptual Modeling. 39th InternationalConference on Conceptual Model, Nov. 2020. 2020;
1. Bernasconi A., Ceri S., Campi A., Masseroli M.. Mayr H.C., Guizzardi G., Ma H., Pastor O.. Conceptual Modeling for Genomics: Building an Integrated Repository of Open Data. Conceptual Modeling. 2017; Cham: Springer International Publishing; 325–339.
1. Bernasconi A., Canakoglu A., Masseroli M., Ceri S.. META-BASE: a Novel Architecture for Large-Scale Genomic Metadata Integration. IEEE/ACM Trans. Comput. Biol. Bioinform. 2020; https://ieeexplore.ieee.org/document/9104916. - PubMed
1. Canakoglu A., Bernasconi A., Colombo A., Masseroli M., Ceri S.. GenoSurf: metadata driven semantic search system for integrated genomic datasets. Database. 2019; 2019:baz132. - PMC - PubMed
1. O’Leary N.A., Wright M.W., Brister J.R., Ciufo S., Haddad D., McVeigh R., Rajput B., Robbertse B., Smith-White B., Ako-Adjei D. et al. .. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016; 44:D733–D745. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

693174/ERC_/European Research Council/International

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

ViruSurf: an integrated database to investigate viral sequences

Affiliations

ViruSurf: an integrated database to investigate viral sequences

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical

Molecular Biology Databases

Miscellaneous