Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jun 1;9(1):260.
doi: 10.1038/s41597-022-01348-9.

CoV2K model, a comprehensive representation of SARS-CoV-2 knowledge and data interplay

Affiliations

CoV2K model, a comprehensive representation of SARS-CoV-2 knowledge and data interplay

Tommaso Alfonsi et al. Sci Data. .

Abstract

Since the outbreak of the COVID-19 pandemic, many research organizations have studied the genome of the SARS-CoV-2 virus; a body of public resources have been published for monitoring its evolution. While we experience an unprecedented richness of information in this domain, we also ascertained the presence of several information quality issues. We hereby propose CoV2K, an abstract model for explaining SARS-CoV-2-related concepts and interactions, focusing on viral mutations, their co-occurrence within variants, and their effects. CoV2K provides a clear and concise route map for understanding different connected types of information related to the virus; it thus drives a process of data and knowledge integration that aggregates information from several current resources, harmonizing their content and overcoming incompleteness and inconsistency issues. CoV2K is available for exploration as a graph that can be queried through a RESTful API addressing single entities or paths through their relationships. Practical use cases demonstrate its application to current knowledge inquiries.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Different Namings, classes, and Contexts (i.e., characterizing mutations) given to the most known WHO-named variants available on June 18th, 2021. Information heterogeneity is highlighted in yellow; sources report different groupings, classes, names, and mutation characterizations.
Fig. 2
Fig. 2
CoV2K abstract model. The areas on the left of the dashed vertical line represent knowledge about SARS-CoV-2, whereas the right areas contain its data. Within each area, entities are represented by white rectangles. Within and across areas, entities are connected by relationships, of five kinds: i) directed edges between two entities represent functional relationships, e.g., each Context may refer to only one Variant; ii) a direct edge with the marked cardinality “2” denotes a double functional relationship (one AA residue change involves two AA residues); iii) direct edges connecting one (father) entity to many (child) entities represent a generalization hierarchy, e.g., Effects are either Variant effects or Change effects or Group effects; iv) indirect edges represent many-to-many relationships, e.g., each Variant may refer to many Variant effects and each Variant effect may refer to many Variants; v) dashed lines represent knowledge-data connections. The names of relationships are read along the direction of the arrow for functional relationships, else the direction is clear from the context.
Fig. 3
Fig. 3
A representative instance of CoV2K, highlighting a few illustrative concepts and connections. The example refers to a variant identified as V1 and best known as Alpha (using the name assigned by the WHO organization); several alternative names are given by other organizations (red labels). The variant is associated with contexts C1–C4, each assigned by a different organization. Each context includes several amino acid positional changes. Context C2, provided by ECDC, only includes the three most representative changes on the Spike protein. Context C1 includes 24 amino acid changes – most of them are omitted in the figure. The example shows overlaps between representative amino acid changes. All changes are linked to their protein regions, possibly through their sub-regions (e.g., RBD). Effects are linked to variants, to groups of changes, or to individual changes; they are labeled with their evidence source: an organization or publication (red labels). Finally, the P-H change links to Proline and Histidine residues. Bold lines highlight one of the paths captured by the query in Use Case 1.
Fig. 4
Fig. 4
Data integration pipeline. For each area, we show the employed sources from which information is extracted, transformed and loaded, in a MongoDB instance (Knowledge areas) or a PostgreSQL instance (Data areas). A number of harmonization modules are applied to the Knowledge parts which are then ready to be queried with the RESTful API.
Fig. 5
Fig. 5
Examples of record resolution regarding the Alpha variant, the Epsilon variant, and a specific lineage spread in the United States during Summer 2020 (called Pelican in CoVariants.org).

References

    1. Bernasconi A, Canakoglu A, Masseroli M, Pinoli P, Ceri S. A review on viral data sources and search systems for perspective mitigation of covid-19. Briefings in Bioinformatics. 2021;22:664–675. doi: 10.1093/bib/bbaa359. - DOI - PMC - PubMed
    1. World Health Organization. Tracking SARS-CoV-2 variants. https://www.who.int/en/activities/tracking-SARS-CoV-2-variants/. Last accessed: March 8th, 2022. - PubMed
    1. Rambaut A, et al. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nature Microbiology. 2020;5:1403–1407. doi: 10.1038/s41564-020-0770-5. - DOI - PMC - PubMed
    1. Shu, Y. & McCauley, J. GISAID: Global initiative on sharing all influenza data–from vision to reality. Eurosurveillance22 (2017). - PMC - PubMed
    1. Hadfield J, et al. Nextstrain: real-time tracking of pathogen evolution. Bioinformatics. 2018;34:4121–4123. doi: 10.1093/bioinformatics/bty407. - DOI - PMC - PubMed