Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Oct;52(10):991-998.
doi: 10.1038/s41588-020-0700-8.

The UCSC SARS-CoV-2 Genome Browser

Affiliations

The UCSC SARS-CoV-2 Genome Browser

Jason D Fernandes et al. Nat Genet. 2020 Oct.

Abstract

Background:: Researchers are generating molecular data pertaining to the SARS-CoV-2 RNA genome and its proteins at an unprecedented rate during the COVID-19 pandemic. As a result, there is a critical need for rapid and continuously updated access to the latest molecular data in a format in which all data can be quickly cross-referenced and compared. We adapted our genome browser visualization tool to the viral genome for this purpose. Molecular data, curated from published studies or from database submissions, are mapped to the viral genome and grouped together into “annotation tracks” where they can be visualized along the linear map of the viral genome sequence and programmatically downloaded in standard format for analysis.

Results:: The UCSC Genome Browser for SARS-CoV-2 (https://genome.ucsc.edu/covid19.html) provides continuously updated access to the mutations in the many thousands of SARS-CoV-2 genomes deposited in GISAID and the international nucleotide sequencing databases, displayed alongside phylogenetic trees. These data are augmented with alignments of bat, pangolin, and other animal and human coronavirus genomes, including per-base evolutionary rate analysis. All available annotations are cross-referenced on the virus genome, including those from major databases (PDB, RFAM, IEDB, UniProt) as well as up-to-date individual results from preprints. Annotated data include predicted and validated immune epitopes, promising antibodies, RT-PCR and sequencing primers, CRISPR guides (from research, diagnostics, vaccines, and therapies), and points of interaction between human and viral genes. As a community resource, any user can add manual annotations which are quality checked and shared publicly on the browser the next day.

Conclusions:: We invite all investigators to contribute additional data and annotations to this resource to accelerate research and development activities globally. Contact us at genome-www@soe.ucsc.edu with data suggestions or requests for support for adding data. Rapid sharing of data will accelerate SARS-CoV-2 research, especially when researchers take time to integrate their data with those from other labs on a widely-used community browser platform with standardized machine-readable data formats, such as the SARS-CoV-2 Genome Browser.

PubMed Disclaimer

Conflict of interest statement

Conflict of interest statement. A.S.H., H.C, J.N.G., B.T.L., L.R.N., B.J.R., K.R.R., D.S., A.S.Z., W.J.K., D.H., and M.H. receive royalties from the sale of UCSC Genome Browser source code, LiftOver, GBiB, and GBiC licenses to commercial entities. W.J.K. owns Kent Informatics.

Figures

Figure 1:
Figure 1:. A quick overview of the UCSC Genome Browser user interface structure.
Navigation controls at the top allow users to move left and right and to zoom. The Position Bar shows a highlighted red box illustrating the current portion of the genome being viewed. The Search Box allows users to search for particular features or to move to exact genomic coordinates. The RNA sequence is only shown when sufficiently zoomed in. Annotations are shown for data tracks that have been set to visible. Here, the NCBI Genes track shows the annotation of the end of the Spike (S) protein and the start of ORF3a, as well as the amino acid translation of their codons. Below that a track showing recurrent SARS-CoV-2 variants that have been observed around the world as reported by nextstrain.org. Bar graphs show the frequency of each allele and mouseover gives the counts of each allele. The next track (Bat CoV multiz) shows a multiple alignment of 44 bat coronaviruses aligned to the reference. Overall these viruses align well to this region of SARS-CoV-2 (the dot means the amino acid is identical) although one non-synonymous substitution (S1261P) is observed in one virus, Rs9401. The final track shows a CD8 positive epitope from IEDB (see Figure 7 for additional details). Tracks can be configured with a right-click or alternatively by clicking on their name near the bottom of the page. Only 12 of the 48 currently available track configuration buttons are shown in this figure due to space limitations. Custom data tracks generated by users can be added directly via the “add custom tracks” button. Additional options can be set via the Menu bar at the top (e.g. the “View” menu allows additional changes to the browser window). (Live interactive session for this figure: http://genome.ucsc.edu/s/SARS_CoV2/Figure1)
Figure 2:
Figure 2:. The four visibility modes of annotation tracks.
Four different ways to display the protein products of the viral genome. Shown here are the “UniProt full-length proteins” track from the UniProt track collection. “Dense” mode shows a single line highlighting any base that carries an annotation, while “squish”, “pack” and “full” expand the annotations in more detail but use more screen space. (Live Interactive Session: http://genome.ucsc.edu/s/SARS_CoV2/Figure2)
Figure 3:
Figure 3:. Molecular and Genomic Visualizations of SARS-CoV-2 Transcription on the browser.
A) SARS-CoV-2 produces mRNAs by discontinuous transcription. First replication-transcription complexes (RTCs) initiate generation of (−) RNA strand (red strand) from the positive strand (black). When RTCs encounter a body transcription regulatory sequence (TRS-B) they have a chance to “jump” to the TRS-Leader (TRS-L) via long range RNA-RNA interactions. Alternatively they can proceed as usual transcribing along the genome. The jumping process generates several different species of (−) RNA strands that lack sequence between the various TRS-B sequences and the TRS-L. These (−) strands then serve as templates for positive transcription from the TRS-L to generate a variety of viral mRNAs that produce different viral proteins. B) A simple, compact and machine-readable genomic visualization for this complex biological process. (Top) All viral mRNA species are shown as annotations on the reference genome. Black bars represent nucleotides present in an mRNA species while arrows represent the sequence that has been skipped during discontinuous transcription. Thick black bars represent the coding sequence predicted to be translated in these RNA species. (Middle) The core TRS motif, ACGAAC, annotated on the genome, corresponds to transcript junctions. (Bottom) Experimental data representing breakpoints that are fusions of TRS-B to TRS-L sequence identified by Oxford Nanopore direct RNA sequencing (Kim et al., 2020). High peaks indicate that the 5’ TRS-L sequence is found directly upstream of the annotated bases in viral RNAs. The majority of these breakpoints correlate with TRS-B motifs. (Live Interactive Session: http://genome.ucsc.edu/s/SARS_CoV2/Figure3)
Figure 4:
Figure 4:. Cleavage of viral polyproteins.
(Above) Browser track showing all annotated viral ORFs from the NCBI Genes track. Below the ORFs are the mature protein products that result from cleavage of the viral polyproteins by the viral proteases nsp3/PL-PRO and nsp5/3CL-PRO, as well as a track showing annotated cleavage sites. Also shown are two sites in the S (Spike) protein (furin_like_cleavage and fusion_peptide_cleavage) that are recognized by host cellular proteases instead of the above two viral proteases. Cleavage of coronavirus Spike protein generates mature subunits that allow the virus to enter cells. (Below) Cartoon representation showing abstractly the cleavage of polyprotein peptide sequences by the viral proteases to generate mature proteins. The viral polyproteins are cleaved by the PL-PRO protease at 3 locations that match the amino acid pattern LXGGX (X = any amino acid) as indicated; 3CL-PRO cleaves many more sites, typically at the pattern LQSAG as shown. (Live Interactive Session: http://genome.ucsc.edu/s/SARS_CoV2/Figure4)
Figure 5:
Figure 5:. Orf1a/Orf1ab Ribosomal Frameshifting.
A) Genome Browser annotations detailing translation of orf1a and orf1ab via ribosomal frameshifting. The red highlighted C is read twice by the ribosome due to the upstream poly-U tract and downstream frameshifting RNA structure annotated in the RFAM and RNA predictions track. Predicted base pairing reported by Ragan et al., 2020 and putative secondary structures are visible upon clicking on annotations. Note that tertiary interactions are not shown. (Live Interactive Session: http://genome.ucsc.edu/s/SARS_CoV2/Figure5) B) Schematic representation of ribosomal frameshifting to generate distinct protein products from ORF1a and ORF1ab. After the AAC in the A site of the ribosome recognizes its cognate tRNA, N4401 is added to the nascent peptide and the ribosome prepares to move N4401 to the P site of the ribosome and read the next codon at the A site of the ribosome. Normally this will occur canonically (above) with the ribosome moving +3 nucleotides on the mRNA (GGG) leading to addition of G4402 and normal translation of pp1a. However, occasionally (~10% of the time in SARS-CoV reporter constructs (Plant & Dinman, 2006)) the ribosome will slip due to the highly structured frameshifting element (depicted here as a simple cartoon stem loop) with the bound tRNAs slipping −1 nucleotide, and causing C13468 to remain in the A site of the ribosome. This results in +2 movement along the mRNA and an overall −1 frameshift. Therefore the next codon read is CGG and R4402 is added. Since the pp1a stop codon at 4406 is no longer in frame, ~2700 additional amino acids encoding nsp12–16 are added to the polyprotein.
Figure 6:
Figure 6:. Variation in regions covered by CRISPR guides and PCR primers.
Browser view of a portion of the viral genome coding for the S protein. CRISPR guide sequences for two Cas13-based detection kits developed by the Broad Institute (Metsky et al., 2020) and NYU are visible. Also visible is the right (3’ end) primer of primer-pair number 73 for whole genome assembly using the nanopore protocol from the ARTIC network Version 3. Although the Variants track reports three mutations in this region, non-reference alleles have been observed only 7 (C>T), 2(G>T), and 2(A>T) times in 4353 sequences (observed when mousing over, data not shown), suggesting that these regions are reasonable targets for primers and guides. The 7 instances of C>T are not alarming. An excess of C>T mutations from sequencing is observed throughout the viral genome, and is likely due to spontaneous deamination of cytosine into uracil or APOBEC RNA editing (Simmonds, 2020) (Live Interactive Session: http://genome.ucsc.edu/s/SARS_CoV2/Figure6)
Figure 7:
Figure 7:. Combining Data Tracks to Generate Hypotheses.
A) Browser view of a region of the viral genome that codes for part of the S (Spike) protein. The variants track shows an A>G mutation that causes the amino acid change D614G that is now found more commonly than the reference nucleotide from the original Wuhan outbreak (Andersen et al., 2020; Korber et al., 2020). Additional tracks display peptides within the virus that are predicted to be immunogenic. It is clear that the D614G mutation is contained within a predicted immunogenic peptide. Also shown is an annotated glycosylation site at amino acid 616 (highlighted in aqua) which can affect epitope recognition. B) Structure of the Spike (PDB ID: 6VSB) trimer. Highlighted in blue is the amino acid sequence viewed in A). Inset shows a close up view of the blue region with amino acid side chains. Highlighted in red is (Top) D614 (the product of the allele present in the original reference genome) and (Bottom) G614 (red) substituted in the structure using UCSF Chimera (Pettersen et al., 2004). C) Structure of the immunogenic peptide YQDVNCTEV in complex with HLA-A*02:01. D614 (red) is nestled within the binding groove, leading to a hypothesis that the G614 mutation may alter binding. Note that although browser-based comparisons of this data lend insight into possible models for the increased frequency of G614, further evolutionary and experimental analyses are required to make definitive statements about the functional consequences of this mutation. (Live Interactive Session: http://genome.ucsc.edu/s/SARS_CoV2/Figure7)
Figure 8:
Figure 8:. Comparative Genomic Analyses of SARS-like Coronaviruses.
A view of the “44 Bat CoVs” track in the ACE2 binding region of the Spike protein. Red indicates a nonsynonymous change, green a synonymous change, and pale yellow (occasionally with blue border) indicates regions where no alignable sequence exists. The high divergence (red and pale yellow) in the amino acids within the ACE2 binding site (upper track) relative to those outside of the binding site is an indication of positive selection within the binding site. The “Bat PhastCons” track immediately above the multiple alignment summarizes per-base evolutionary rates for the nucleotide positions in the virus, light gray regions are more rapidly evolving (less conserved) than the black (very conserved) regions. This region of S is expected to experience selection as the ACE2 protein itself rapidly evolves between species in a “genetic arms race” with viruses that use this site. (Live Interactive Session: http://genome.ucsc.edu/s/SARS_CoV2/Figure8a) B) Human Genome Browser view of residues known to contact (in human) coronavirus spike proteins (pink) aligned to a variety of other species. These residues are more rapidly evolving (less conserved) in vertebrates (gray bars) than those that do not contact the Spike (solid black bars indicating strong vertebrate conservation). (Live Interactive Session: http://genome.ucsc.edu/s/SARS_CoV2/Figure8b)
Figure 9:
Figure 9:. Phylogenetic Analyses of Clade-Specific Variation in SARS-CoV-2.
SARS-CoV-2 mature protein products are shown at the top of the display for context. Below that are three tracks containing Nextstrain phylogenetic trees and clades for selected SARS-CoV-2 genomes sequenced from samples deposited in GISAID. Each row represents a single viral genome, black bars represent the presence of a mutation in the genome submitted to GISAID compared to the reference at that genomic position. As expected, many mutational patterns cluster with branches of the phylogenetic tree. The first track contains data from all 4147 sequences available from Nextstrain as of April 30, 2020. Clades identified by Nextstrain are colored by the same scheme used on the Nextstrain site. The middle track shows only samples from the Nextstrain B clade (warm colors) with increased vertical resolution, and the bottom track shows only samples from the Nextstrain B2 subclade of B (red-orange color) with even greater vertical resolution. At this resolution the sample identifier and additional sample information become visible, including time and location of the sample collection. (Live Interactive Session: http://genome.ucsc.edu/s/SARS_CoV2/Figure9)

References

    1. Abbott TR, Dhamdhere G, Liu Y, Lin X, & Goudy LE (2020). Development of CRISPR as a prophylactic strategy to combat novel coronavirus and influenza. bioRxiv. https://www.biorxiv.org/content/10.1101/2020.03.13.991307v1.abstract - DOI
    1. Andersen KG, Rambaut A, Ian Lipkin W, Holmes EC, & Garry RF (2020). The proximal origin of SARS-CoV-2. In Nature Medicine (Vol. 26, Issue 4, pp. 450–452). 10.1038/s41591-020-0820-9 - DOI - PMC - PubMed
    1. artic-ncov. (2019). Github. https://github.com/artic-network/artic-ncov2019
    1. Barretto N, Jukneliene D, Ratia K, Chen Z, Mesecar AD, & Baker SC (2005). The papain-like protease of severe acute respiratory syndrome coronavirus has deubiquitinating activity. Journal of Virology, 79(24), 15189–15198. 10.1128/JVI.79.24.15189-15198.2005 - DOI - PMC - PubMed
    1. Bekaert M, & Rousset J-P (2005). An extended signal involved in eukaryotic −1 frameshifting operates through modification of the E site tRNA. Molecular Cell, 17(1), 61–68. 10.1016/j.molcel.2004.12.009 - DOI - PMC - PubMed

Publication types