Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jan;20(1):13-25.
doi: 10.2217/pme-2022-0013. Epub 2023 Mar 28.

SARS-CoV-2 variant identification using a genome tiling array and genotyping probes

Affiliations

SARS-CoV-2 variant identification using a genome tiling array and genotyping probes

Ryota Shimada et al. Per Med. 2023 Jan.

Abstract

With over 5.5 million deaths worldwide attributed to the respiratory disease COVID-19 caused by the novel coronavirus SARS-CoV-2, it is essential that continued efforts be made to track the evolution and spread of the virus globally. The authors previously presented a rapid and cost-effective method to sequence the entire SARS-CoV-2 genome with 95% coverage and 99.9% accuracy. This method is advantageous for identifying and tracking variants in the SARS-CoV-2 genome compared with traditional short-read sequencing methods which can be time-consuming and costly. Herein, the addition of genotyping probes to a DNA chip that targets known SARS-CoV-2 variants is presented. The incorporation of genotyping probe sets along with the advent of a moving average filter improved the sequencing coverage and accuracy of the SARS-CoV-2 genome.

Keywords: COVID-19; SARS-CoV-2; bioinformatics; genotyping; pandemic; screening; sequencing; tiling-array; viral genome.

Plain language summary

Throughout the COVID-19 pandemic the virus known as SARS-CoV-2 has continued to mutate and evolve. It is imperative to continue to track these mutations and where the virus has traveled to best inform healthcare practices and global strategies to combat the virus. The authors previously developed a method to investigate 95% of this viral genome with 99.9% accuracy that was more cost-effective and less time-consuming than previous methods. In this work, specific markers were added to the technology to allow tracking of mutations in the virus that have already been documented. In doing so, the accuracy and how much of the viral genome can be sequenced was improved.

PubMed Disclaimer

Conflict of interest statement

The authors declare the following competing financial interest(s): The Centrillion affiliated authors and J.S.E are employees of the company and the company is commercializing the work described herein.

Figures

Figure 1
Figure 1
Overview of tiling array protocol and bioinformatics analysis strategy; a. Overview of sample preparation and handling. Samples are converted fron RNA to cDNA using ligase, followed by amplification biotin-11-dUTP, followed by fragmentation, hybridization to the tiling array, staining with Cy3 streptavidin, and imaging using a custom confocal scanner; b. Overview of base calling algorithm. Raw intensity measurements are ranked in decreasing order for each position probed. Values are used to calculate the difference (d) and differential (D) for each position. We make a ML model using these parameters and assign base calls. Resequenced data is aligned with the reference genome and used for further analysis; c. replacement iteration 1 looks at all variant calls. The variant calls made by the genotyping probeset may or may not match the call made by the genome tiling array. The Q scores of the two reads are compared. If the genotyping probeset confirms the variant call (ie genome tiling array = genotyping array) the call confirms the variant and the Q-score is replaced. If the variant calls between the two reads differ, the call and read with the higher Q-score is used to make the base call. All known variants at the time of array fabrication are accounted for; variants called by the genotyping array are automatically at ‘site of known variant’. If any calls are non-calls (ie Q-score < 20), when the genotyping probset Q-score is >20, calls are automatically replaced; d. replacement iteration 2 looks at all calls after replacement iteration 1. When he genotyping probeset makes higher quality base calls than the genome tiling array equivalent, the Q-score is replaced. The genotyping probeset can make correct reference or variant calls. If the Q-score is higher from the genotyping probeset and it confirms the reference or variant, base call and Q-score from the genotyping probeset is used. This raises the overall quality and confidence of the calls. ‘Confirmation’ refers to whether the genotyping probeset call matches the reference or the variant call made by the genome tiling array. ‘Non-confirmed’ indicates calls that do not match reference or known variant call; e. The MA-filter is constructed using the final Q-scores after replacement iteration 1 and 2. For each base call, the average Q-score of the 12 reads before and after are averaged to determine the MAQ at the position. The average of all MAQ – 2x standard deviation is used as the cutoff. All calls with a Q-score below 30 are entered into the filter. Of the intermediate quality calls, calls with a MAQ below MAQ – 2x standard deviation is designated non-call and the base call is not used.
Figure 2
Figure 2
Representative schematic of tiling array design and probe design for the genome tiling array and genotyping probe set; a. A small section of the genome tiling array. Each square is a feature covered with a specific 25mer oligonucleotide probe for a specific position for target genome. A probe set is a grouping of four features, one for each potential base. Highlighted in red are the probe sets for position 28122 and 28133. Probe sets are roughly organized in the order of base pair position; b. conceptual representation of tiling array probes for positions 28132, 28133, 28134. Each feature consists of a single 25mer probe with one of 4 possible bases at the 13th position indicated by the green box which is the resequenced position. The base highlighted in blue is the reference base call. Others indicated in red are potential variants. Flanking 12bp on either side exactly match the reference base calls; c. Representative section of genotyping probe set design. The layout design, organization of probe sets, and 25mer design concept are identical to genome tiling array. Highlighted in red are features within the probe set for position 241, a site of known variant; d. conceptual representation depicting difference between probe sets for reference 241 (top) and variant 241* (bottom). Bases highlighted by the green box at the 13th position indicate the probed position. The base highlighted in blue is the reference base call. For position 241, the reference base call is C while in 241* the reference base call is T. All other positions of the 25mer probes exactly match the reference base calls.
Figure 3.
Figure 3.. Scatter plot, WY64 Q score and MAQ with variant call breakdown after MA-filter.
A scatter plot generated in R using ggplot2 displaying all reads made by the full genome tiling array reads with incorporation of genotyping probe-set data and MA-filter on sample WY64 between position 26 and 29834. Every call was assessed with a combination of base call and Q score. High quality reads with Q > Qth (20) are used to make calls. Dark gray circles represent base positions where Q > Qth and make a ‘Reference Call’. All calculated MAQ are overlaid as light gray. Brown dots are reads that have a low Q score where Q < Qth, are categorically ‘Non-calls’, and excluded to make base-calls. Q scores of ‘Variant Calls’ are identified as larger red circles, where the final base call made by DNA chip after replacement and MA-filter is not reference and the Q > Qth. Within ‘Variant Calls’, a blue overlap indicates calls that are removed by the MA-filter, which takes all variant calls with a Q score between 20 and 30 and removes any with a MAQ lower than the MA-threshold. Within ‘Variant Calls’, a green overlap indicates true variant calls identified and verified with short-read sequencing data. Additional colorized scatter plots for the remaining samples can be found under Supplemental Figures S1–S8.
Figure 4.
Figure 4.. Venn Diagram, WY64 Variant Calls after MA-filter.
A Venn diagram depicting the categorical breakdown of calls made by the DNA chip on sample WY64. Variant Calls made by the DNA chip are contained within the red circle. Any circle or number outside of red implies that the call is a reference call. True variants are color coded in green and are confirmed by short-read sequencing data. The blue circle indicates variant calls that are filtered out by the MA-filter, converted to non-calls, and omitted from making base calls. Overlap between green and red circles indicates true variants correctly called by the DNA chip and verified through short-read sequencing data. Overlap between green and blue circles indicate an improperly removed true variant by the MA-filter where the read is situated in a local region of low MAs indicated by a low MAQ below the MAQ threshold. Ideally both green and blue circles should be contained within the red circle without overlapping each other, implying that all true variants and reads filtered out by the MA-filter were correctly identified as variant calls. Additional Venn diagrams for the remaining samples can be found under Supplemental Figures S9–S16.

Update of

Similar articles

Cited by

References

    1. Johns Hopkins University & Medicine. Coronavirus Resource Center (2022). https://coronavirus.jhu.edu/map.html
    1. Drew T The emergence and evolution of swine viral diseases: to what extent have husbandry systems and global trade contributed to their distribution and diversity? Rev Sci Tech 30(1):95–106 (2011). - PubMed
    1. Sharma A, Tiwari S, Deb MK, Marty JL. Severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2): a global pandemic and treatment strategies. Int J Antimicrob Agents 56(2):106054 (2020). - PMC - PubMed
    1. Domingo E, Martin V, Perales C, Grande-Pérez A, García-Arriaza J, Arias A. Viruses as Quasispecies: Biological Implications. Curr Top Microbiol Immunol 299, 51–82 (2006). - PMC - PubMed
    1. Minskaia E, Hertzig T, Gorbalenya AE, Campanacci V, Cambillau C, Canard B, Ziebuhr J. Discovery of an RNA Virus 3′→5′ Exoribonuclease That Is Critically Involved in Coronavirus RNA Synthesis. PNAS 103(13), 5108–5113 (2006). - PMC - PubMed

Publication types

Supplementary concepts