Benchmark datasets for SARS-CoV-2 surveillance bioinformatics
- PMID: 36093336
- PMCID: PMC9454940
- DOI: 10.7717/peerj.13821
Benchmark datasets for SARS-CoV-2 surveillance bioinformatics
Abstract
Background: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the cause of coronavirus disease 2019 (COVID-19), has spread globally and is being surveilled with an international genome sequencing effort. Surveillance consists of sample acquisition, library preparation, and whole genome sequencing. This has necessitated a classification scheme detailing Variants of Concern (VOC) and Variants of Interest (VOI), and the rapid expansion of bioinformatics tools for sequence analysis. These bioinformatic tools are means for major actionable results: maintaining quality assurance and checks, defining population structure, performing genomic epidemiology, and inferring lineage to allow reliable and actionable identification and classification. Additionally, the pandemic has required public health laboratories to reach high throughput proficiency in sequencing library preparation and downstream data analysis rapidly. However, both processes can be limited by a lack of a standardized sequence dataset.
Methods: We identified six SARS-CoV-2 sequence datasets from recent publications, public databases and internal resources. In addition, we created a method to mine public databases to identify representative genomes for these datasets. Using this novel method, we identified several genomes as either VOI/VOC representatives or non-VOI/VOC representatives. To describe each dataset, we utilized a previously published datasets format, which describes accession information and whole dataset information. Additionally, a script from the same publication has been enhanced to download and verify all data from this study.
Results: The benchmark datasets focus on the two most widely used sequencing platforms: long read sequencing data from the Oxford Nanopore Technologies platform and short read sequencing data from the Illumina platform. There are six datasets: three were derived from recent publications; two were derived from data mining public databases to answer common questions not covered by published datasets; one unique dataset representing common sequence failures was obtained by rigorously scrutinizing data that did not pass quality checks. The dataset summary table, data mining script and quality control (QC) values for all sequence data are publicly available on GitHub: https://github.com/CDCgov/datasets-sars-cov-2.
Discussion: The datasets presented here were generated to help public health laboratories build sequencing and bioinformatics capacity, benchmark different workflows and pipelines, and calibrate QC thresholds to ensure sequencing quality. Together, improvements in these areas support accurate and timely outbreak investigation and surveillance, providing actionable data for pandemic management. Furthermore, these publicly available and standardized benchmark data will facilitate the development and adjudication of new pipelines.
Keywords: Benchmarking; COVID-19; Standardization; WGS; sha256.
©2022 Xiaoli et al.
Conflict of interest statement
The authors declare there are no competing interests.
Figures

Similar articles
-
Bioinformatic investigation of discordant sequence data for SARS-CoV-2: insights for robust genomic analysis during pandemic surveillance.Microb Genom. 2023 Nov;9(11):001146. doi: 10.1099/mgen.0.001146. Microb Genom. 2023. PMID: 38019123 Free PMC article.
-
Proficiency testing for SARS-CoV-2 whole genome sequencing.Pathology. 2022 Aug;54(5):615-622. doi: 10.1016/j.pathol.2022.04.002. Epub 2022 Jun 29. Pathology. 2022. PMID: 35778290 Free PMC article.
-
Rapid, high-throughput, cost-effective whole-genome sequencing of SARS-CoV-2 using a condensed library preparation of the Illumina DNA Prep kit.J Clin Microbiol. 2024 Mar 13;62(3):e0010322. doi: 10.1128/jcm.00103-22. Epub 2024 Feb 5. J Clin Microbiol. 2024. PMID: 38315007 Free PMC article.
-
Empirical Comparison and Analysis of Artificial Intelligence-Based Methods for Identifying Phosphorylation Sites of SARS-CoV-2 Infection.Int J Mol Sci. 2024 Dec 21;25(24):13674. doi: 10.3390/ijms252413674. Int J Mol Sci. 2024. PMID: 39769436 Free PMC article. Review.
-
Systematic comparison of ranking aggregation methods for gene lists in experimental results.Bioinformatics. 2022 Oct 31;38(21):4927-4933. doi: 10.1093/bioinformatics/btac621. Bioinformatics. 2022. PMID: 36094347 Free PMC article.
Cited by
-
PHA4GE quality control contextual data tags: standardized annotations for sharing public health sequence datasets with known quality issues to facilitate testing and training.Microb Genom. 2024 Jun;10(6):001260. doi: 10.1099/mgen.0.001260. Microb Genom. 2024. PMID: 38860884 Free PMC article.
-
PathoSeq-QC: a decision support bioinformatics workflow for robust genomic surveillance.Bioinformatics. 2025 Mar 29;41(4):btaf102. doi: 10.1093/bioinformatics/btaf102. Bioinformatics. 2025. PMID: 40053686 Free PMC article.
-
Lessons learned: overcoming common challenges in reconstructing the SARS-CoV-2 genome from short-read sequencing data via CoVpipe2.F1000Res. 2024 Apr 16;12:1091. doi: 10.12688/f1000research.136683.2. eCollection 2023. F1000Res. 2024. PMID: 38716230 Free PMC article.
-
SARS-CoV-2 Illumina GeNome Assembly Line (SIGNAL), a Snakemate workflow for rapid and bulk analysis of Illumina sequencing of SARS-CoV-2 genomes.NAR Genom Bioinform. 2024 Dec 18;6(4):lqae176. doi: 10.1093/nargab/lqae176. eCollection 2024 Dec. NAR Genom Bioinform. 2024. PMID: 39703420 Free PMC article.
-
Bioinformatic investigation of discordant sequence data for SARS-CoV-2: insights for robust genomic analysis during pandemic surveillance.Microb Genom. 2023 Nov;9(11):001146. doi: 10.1099/mgen.0.001146. Microb Genom. 2023. PMID: 38019123 Free PMC article.
References
-
- Andrews S. Babraham bioinformatics—FastQC a quality control tool for high throughput sequence data. https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ [03 November 2021];2010
-
- ARTIC Home—artic pipeline. 2020. https://artic.readthedocs.io/en/latest/?badgelatest. [30 November 2021]. https://artic.readthedocs.io/en/latest/?badgelatest
-
- Baker DJ, Aydin A, Le-Viet T, Kay GL, Rudder S, De Oliveira Martins L, Tedim AP, Kolyva A, Diaz M, Alikhan N-F, Meadows L, Bell A, Gutierrez AV, Trotter AJ, Thomson NM, Gilroy R, Griffith L, Adriaenssens EM, Stanley R, Charles IG, Elumogo N, Wain J, Prakash R, Meader E, Mather AE, Webber MA, Dervisevic S, Page AJ, O’Grady J. CoronaHiT: high-throughput sequencing of SARS-CoV-2 genomes. Genome Medicine. 2021;13:21. doi: 10.1186/s13073-021-00839-5. - DOI - PMC - PubMed
-
- BBMap https://sourceforge.net/projects/bbmap/ [03 November 2021];2021
Publication types
MeSH terms
Grants and funding
- U19 AI110818/AI/NIAID NIH HHS/United States
- BBS/E/F/000PR10352/BB_/Biotechnology and Biological Sciences Research Council/United Kingdom
- BB/CCG1860/1/BB_/Biotechnology and Biological Sciences Research Council/United Kingdom
- BB/R012504/1/BB_/Biotechnology and Biological Sciences Research Council/United Kingdom
LinkOut - more resources
Full Text Sources
Medical
Miscellaneous