Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Mar 11;16(3):430.
doi: 10.3390/v16030430.

Recommendations for Uniform Variant Calling of SARS-CoV-2 Genome Sequence across Bioinformatic Workflows

Affiliations

Recommendations for Uniform Variant Calling of SARS-CoV-2 Genome Sequence across Bioinformatic Workflows

Ryan Connor et al. Viruses. .

Abstract

Genomic sequencing of clinical samples to identify emerging variants of SARS-CoV-2 has been a key public health tool for curbing the spread of the virus. As a result, an unprecedented number of SARS-CoV-2 genomes were sequenced during the COVID-19 pandemic, which allowed for rapid identification of genetic variants, enabling the timely design and testing of therapies and deployment of new vaccine formulations to combat the new variants. However, despite the technological advances of deep sequencing, the analysis of the raw sequence data generated globally is neither standardized nor consistent, leading to vastly disparate sequences that may impact identification of variants. Here, we show that for both Illumina and Oxford Nanopore sequencing platforms, downstream bioinformatic protocols used by industry, government, and academic groups resulted in different virus sequences from same sample. These bioinformatic workflows produced consensus genomes with differences in single nucleotide polymorphisms, inclusion and exclusion of insertions, and/or deletions, despite using the same raw sequence as input datasets. Here, we compared and characterized such discrepancies and propose a specific suite of parameters and protocols that should be adopted across the field. Consistent results from bioinformatic workflows are fundamental to SARS-CoV-2 and future pathogen surveillance efforts, including pandemic preparation, to allow for a data-driven and timely public health response.

Keywords: SARS-CoV-2; variant calling.

PubMed Disclaimer

Conflict of interest statement

R.M., J.L., and E.S.M. are employees of, and stockholders in, Gilead Sciences, Inc. J.d.I and L.A.P. are employees of Vir Biotechnology, Inc. and may hold shares in Vir Biotechnology, Inc. L.A.P. is a former employee and shareholder of Regeneron Pharmaceuticals and is a member of the Scientific Advisory Board of the AI-driven Structure-enabled Antiviral Platform (ASAP). P.E. is an employee of, and holds stock or stock options in, Eli Lilly and Company. Courtney Copeland and Tré LaRosa are employees of Deloitte Consulting LLP and have indicated they have no conflicts of interest relevant to this article to disclose. The remaining authors declare no conflicts of interest.

Figures

Figure 1
Figure 1
Flow chart of the platforms from each participating organization’s workflows at the time of analysis. Shown are the schematics for (A) Illumina platform variant calling and (B) Oxford Nanopore Technologies (ONT) variant calling. For each sequencing platform, the main steps of variant calling are captured in each box, including: read retrieval, host removal, read trimming, alignment, variant calling, variant filtering, and variant normalization. For each step, the software used by each workflow is noted.
Figure 2
Figure 2
The impact of host contamination removal and primer trimming. (A) The removal of host reads from RNAseq SARS-CoV-2 sequencing result, SRA run SRR12245095, reduced the potential for false-positive variant calls. In the top panel, additional mutations were present in aligned reads between positions 3049–3076 of NC_045512 when host reads were not removed. After excluding host reads (bottom panel), reads containing the mutations were no longer observed. (B) Allele frequencies of variants called after trimming primer sequences from aligned reads (corrected allele frequencies) are plotted against allele frequencies of the same variants called without primer trimming (uncorrected allele frequencies). Primer trimming increases the allele frequencies of most within-primer binding sites variants. Blue lines represent the allele-frequency thresholds used in this study to filter variant calls (allele frequency; AF ≥ 0.15) and to call consensus variants (AF ≥ 0.5).
Figure 3
Figure 3
The effect of Alternate Allele Depth and Alternate Allele Frequency on variant calling agreement across workflows and platforms. For each panel, calls made by all but one workflow (AD) or both platforms (EH) were considered true-positives, while calls made by only a single workflow (or technology) were considered false-positives, thus the ROC AUCs cannot be directly compared between groups. For the right panels, points represent an Allele Frequency (AF) cut-off of 1 at the lower-leftmost point, and the cut-off decreases by 0.1 along the length of the line. For the left panels, the points represent a minimum Alternate Allele Depth (AltDP) going from 4,000 at the lower-left most point to 10 along each line. (A,B) Impact of AltDP and AF, respectively, on Illumina workflow accuracy and specificity across workflows. (C,D) Impact of AltDP and AF, respectively, on Illumina workflow accuracy and specificity across platforms. (E,F) Impact of AltDP and AF, respectively, on ONT workflow accuracy and specificity across workflows. (G,H) Impact of AltDP and AF, respectively, on ONT workflow accuracy and specificity across platforms.
Figure 4
Figure 4
Agreement across workflows with and without recommended parameters. (AD) Agreement across workflows, without recommended parameters. (EH) Agreement across workflows, with recommended parameters. (A,E) Agreement on Illumina SNP calls. (B,F) Agreement on Illumina InDel calls. (C,G) Agreement on Oxford Nanopore (ONT) SNP calls. (D,H) Agreement on ONT InDel Calls. For each figure, the bars indicate the number of variants called by the groups, indicated by filled circles below, across the whole dataset.
Figure 5
Figure 5
Application of recommended parameters results in increased agreement across platforms. Graphical representation of the agreement between platforms without the application of recommended parameters of SNP (A) and InDel (B) calls. (C) (SNP) and (D) (InDel) represent the agreement between platforms after the application of the recommended parameters. For each figure, only those samples for which both Illumina and ONT platform data had at least one variant call that passed all of the filters were considered. The total height is normalized to the total number of calls made by each workflow, with light blue portion indicating calls made on both platforms for a given sample, medium blue indicating calls made only for Illumina data, and dark blue indicating calls made only for ONT data.
Figure 6
Figure 6
Variant calling workflow recommendations. Outline of the recommendations for each step in a variant calling workflow, from read cleanup to variant filtering, are illustrated. Additionally, the benefit of implementing the recommendations at each step are noted.

Update of

References

    1. Khoury M.J., Bowen M.S., Clyne M., Dotson W.D., Gwinn M.L., Green R.F., Kolor K., Rodriguez J.L., Wulf A., Yu W. From Public Health Genomics to Precision Public Health: A 20-Year Journey. Genet. Med. 2018;20:574–582. doi: 10.1038/gim.2017.211. - DOI - PMC - PubMed
    1. Van Goethem N., Descamps T., Devleesschauwer B., Roosens N.H.C., Boon N.A.M., Van Oyen H., Robert A. Status and Potential of Bacterial Genomics for Public Health Practice: A Scoping Review. Implement. Sci. 2019;14:1–16. doi: 10.1186/s13012-019-0930-2. - DOI - PMC - PubMed
    1. Sayers E.W., Cavanaugh M., Clark K., Pruitt K.D., Schoch C.L., Sherry S.T., Karsch-Mizrachi I. GenBank. Nucleic Acids Res. 2022;50:D161–D164. doi: 10.1093/nar/gkab1135. - DOI - PMC - PubMed
    1. Katz K., Shutov O., Lapoint R., Kimelman M., Brister J.R., O’Sullivan C. The Sequence Read Archive: A Decade More of Explosive Growth. Nucleic Acids Res. 2022;50:D387–D390. doi: 10.1093/nar/gkab1053. - DOI - PMC - PubMed
    1. Lo S.W., Jamrozy D. Genomics and Epidemiological Surveillance. Nat. Rev. Microbiol. 2020;18:478. doi: 10.1038/s41579-020-0421-0. - DOI - PMC - PubMed