Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Feb 12;11(4):e42613.
doi: 10.1016/j.heliyon.2025.e42613. eCollection 2025 Feb 28.

SGV-caller: SARS-CoV-2 genome variation caller

Affiliations

SGV-caller: SARS-CoV-2 genome variation caller

Jiaqi Wu et al. Heliyon. .

Abstract

Given the pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), continuous analysis of its genomic variations at the nucleotide level is imperative to monitor the emergence of novel variants of concern. The Global Initiative on Sharing All Influenza Data (GISAID) serves as the de facto standard database for the genomic information of SARS-CoV-2. However, limitations of its data-sharing policy hinder the comprehensive analysis of genomic variations. To address this problem, we developed SGV-caller, a bioinformatics pipeline for analyzing the frequently updated GISAID database. SGV-caller compares input datasets with pre-existing databases and generates local databases encompassing nucleotide, amino acid, and codon-level genomic variations for each SARS-CoV-2 genome. Furthermore, SGV-caller accommodates SARS-CoV-2 genomes from non-GISAID sources as well as other viral genomes. SGV-caller source code and test data are available at https://github.com/wujiaqi06/SGV-caller.

Keywords: Bioinformatics pipeline; GISAID; Genome surveillance; Mutations; Viral genomes.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

Fig. 1
Fig. 1
A schematics workflow of SGV-caller for Pipelines #1 and #2. Two major calculation pipelines — Pipeline #1 (starts from step 1) and Pipeline #2 (starts from step 1∗) — are shown as example. Input and output files are shown in white and green boxes, respectively. The gray boxes indicate the script (each number shown in Table 1) of each step. Each dotted box with a number in a bold letter indicates the order of the data flow for each calculation step. An asterisk (∗) indicates an alternative workflow for the step. For example, in the step 1, Script #1 “name2ID.pl” generates a “Target ID list” file (i.e., a “name2ID.txt” file) using GISAID metadata. Alternatively, step 1∗ shows that Script #2 “newly_aded_name2ID.pl” generates a “Target ID list” file (i.e., a “NewlyAdded.name2ID.txt” file) based on an existing SGV database. This file contains the GISAID IDs that are not found in the given SGV-database. In step 2, based on the “Target ID list” file, SGV-caller conducts the pairwise alignment between a reference genome and a target genome. A raw variation for each genome file will be reported in this step. Step 3∗ shows the process of calculating the quality of each genome sequence. The result will include the number of undetermined nucleotides of each target genome and its spike protein, respectively, and the number of different nucleotides to reference. Step 3 screens the raw variation for each genome file and removes all genomic variations that are only due to ambiguous nucleotides. Step 4 maps nucleotide-level variations to codons and amino acids based on the annotation file.

Similar articles

References

    1. Shu Y., McCauley J. GISAID: global initiative on sharing all influenza data - from vision to reality. Euro Surveill. 2017;22(13) doi: 10.2807/1560-7917.ES.2017.22.13.30494. Epub 2017/04/07. - DOI - PMC - PubMed
    1. Attwood S.W., Hill S.C., Aanensen D.M., Connor T.R., Pybus O.G. Phylogenetic and phylodynamic approaches to understanding and combating the early SARS-CoV-2 pandemic. Nat. Rev. Genet. 2022;23(9):547–562. doi: 10.1038/s41576-022-00483-8. Epub 2022/04/24. - DOI - PMC - PubMed
    1. Gangavarapu K., Latif A.A., Mullen J.L., Alkuzweny M., Hufbauer E., Tsueng G., et al. Outbreak.info genomic reports: scalable and dynamic surveillance of SARS-CoV-2 variants and mutations. Nat. Methods. 2023;20(4):512–522. doi: 10.1038/s41592-023-01769-3. Epub 2023 Feb 23. - DOI - PMC - PubMed
    1. Katoh K., Misawa K., Kuma K., Miyata T. Mafft: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 2002;30(14):3059–3066. doi: 10.1093/nar/gkf436. Epub 2002/07/24. - DOI - PMC - PubMed
    1. Kryukov K., Jin L., Nakagawa S. Efficient compression of SARS-CoV-2 genome data using nucleotide archival format. Patterns. 2022;3(9) doi: 10.1016/j.patter.2022.100562. Epub 2022/07/13. - DOI - PMC - PubMed

LinkOut - more resources