This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2020 Oct 1:rs.3.rs-80345.

doi: 10.21203/rs.3.rs-80345/v1.

SARS-CoV-2 gene content and COVID-19 mutation impact by comparing 44 Sarbecovirus genomes

Irwin Jungreis^{1

2}, Rachel Sealfon³, Manolis Kellis^{1

2}

Affiliations

¹ MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA.
² Broad Institute of MIT and Harvard, Cambridge, MA.
³ Center for Computational Biology, Flatiron Institute, New York, NY.

PMID: 33024961
PMCID: PMC7536840
DOI: 10.21203/rs.3.rs-80345/v1

SARS-CoV-2 gene content and COVID-19 mutation impact by comparing 44 Sarbecovirus genomes

Irwin Jungreis et al. Res Sq. 2020.

[Preprint]. 2020 Oct 1:rs.3.rs-80345.

doi: 10.21203/rs.3.rs-80345/v1.

Authors

Irwin Jungreis^{1

2}, Rachel Sealfon³, Manolis Kellis^{1

2}

Affiliations

¹ MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA.
² Broad Institute of MIT and Harvard, Cambridge, MA.
³ Center for Computational Biology, Flatiron Institute, New York, NY.

PMID: 33024961
PMCID: PMC7536840
DOI: 10.21203/rs.3.rs-80345/v1

Update in

SARS-CoV-2 gene content and COVID-19 mutation impact by comparing 44 Sarbecovirus genomes.
Jungreis I, Sealfon R, Kellis M. Jungreis I, et al. Nat Commun. 2021 May 11;12(1):2642. doi: 10.1038/s41467-021-22905-7. Nat Commun. 2021. PMID: 33976134 Free PMC article.

Abstract

Despite its overwhelming clinical importance, the SARS-CoV-2 gene set remains unresolved, hindering dissection of COVID-19 biology. Here, we use comparative genomics to provide a high-confidence protein-coding gene set, characterize protein-level and nucleotide-level evolutionary constraint, and prioritize functional mutations from the ongoing COVID-19 pandemic. We select 44 complete Sarbecovirus genomes at evolutionary distances ideally-suited for protein-coding and non-coding element identification, create whole-genome alignments, and quantify protein-coding evolutionary signatures and overlapping constraint. We find strong protein-coding signatures for all named genes and for 3a, 6, 7a, 7b, 8, 9b, and also ORF3c, a novel alternate-frame gene. By contrast, ORF10, and overlapping-ORFs 9c, 3b, and 3d lack protein-coding signatures or convincing experimental evidence and are not protein-coding. Furthermore, we show no other protein-coding genes remain to be discovered. Cross-strain and within-strain evolutionary pressures largely agree at the gene, amino-acid, and nucleotide levels, with some notable exceptions, including fewer-than-expected mutations in nsp3 and Spike subunit S1, and more-than-expected mutations in Nucleocapsid. The latter also shows a cluster of amino-acid-changing variants in otherwise-conserved residues in a predicted B-cell epitope, which may indicate positive selection for immune avoidance. Several Spike-protein mutations, including D614G, which has been associated with increased transmission, disrupt otherwise-perfectly-conserved amino acids, and could be novel adaptations to human hosts. The resulting high-confidence gene set and evolutionary-history annotations provide valuable resources and insights on COVID-19 biology, mutations, and evolution.

PubMed Disclaimer

Conflict of interest statement

Competing interest declaration

The authors declare no competing interests.

Figures

**Extended Data Figure 1.. PhyloCSF signal for polyprotein ORF1.**
UCSC Genome Browser view SARS-CoV-2 genome for polyprotein ORF1ab showing UniProt gene annotations for individual non-structural proteins (nsp), PhyloCSF tracks (green) in each of 3 reading frames, and Synonymous Constraint Elements (SCEs, red), along with phastCons/phyloP nucleotide-level constraint (green/blue). Polyprotein 1ab is processed into 16 mature peptides nsp1-nsp16. PhyloCSF signal shows clear protein-coding signal for all proteins, indicating clearly that all are functional proteins (except nsp11, red circle, discussed in the main text). PhyloCSF signal captures the correct frame throughout the entire length of each protein (except nsp3, where some small regions show reduced frame-2 signal and/or increased frame-3 signal, but upon inspection these are only stop-codon-free in frame-2 and do not represent dual-coding candidates).

**Extended Data Figure 2.. Phylogenetic tree of 44 Sarbecovirus genomes and larger phylogenetic context.**
Left: Phylogenetic tree of a selection of Coronaviridae genomes, including the seven that infect humans (red asterisks). Right: Phylogenetic tree of the 44 Sarbecovirus genomes used in this study. Trees are based on whole-genome alignments and might be different from the history at particular loci, due to recombination.

**Extended Data Figure 3.. ORF10 is not protein-coding.**
a. Alignment of Sarbecovirus genomes at ORF10, including 30 additional flanking nucleotides on each side. Most substitutions are amino-acid-changing, either radical (red) or conservative (dark green), with only two synonymously-changing positions (light green), indicating this is not a protein-coding region. In addition, nearly all strains show an earlier stop codon (cyan), further reducing the length of this already-short ORF from 38 amino-acids to 25, and one of the four strains lacking the earlier stop includes a frame-shifting deletion. The putative partial transcription-regulatory sequence (TRS) present in SARS-CoV-2 and its closest relative (Bat CoV RaTG13) is not conserved in any other strains. The region surrounding ORF10 shows very high nucleotide-level conservation, which spans ORF10 and extends beyond its boundaries in both directions, indicating that this portion of the genome is functionally important even though it does not code for protein (indeed, this region is part of a pseudoknot RNA structure involved in RNA synthesis). b. Ribosome footprints previously used to suggest ORF10 translation in fact localize either in an upstream ORF (uORF, green) or in an internal ORF (green, “final predictions” track), but not in the unique portion of ORF10 (dashed black box), indicating they are less likely to reflect functional translation of ORF10, and more likely to represent incidental translation initiation events. We note that the density of elongating footprints in the unique portion (black box) is no greater than the density after the stop codon (red box), consistent with incidental events. We also note that the internal ORF is only 18 codons long in 4 strains, and only 5 codons long in the other 40 Sarbecovirus strains, given the early stop codon (purple box) and unlikely to be functional. Footprint tracks show elongating ribosome footprints in cells treated with cycloheximide (blue, CHX), and footprints enriched for initiating ribosomes using harringtonine (Harr, red), and lactimidomycin (LTM, green). “mRNA-seq” track shows RNA-seq reads. c. CodAlignView of alignment previously used to argue that a high dN/dS ratio in ORF10 indicated positive selection for protein-coding-like rapid evolution, based on only six closely-related strains (SARS-CoV-2, three bat viruses, two pangolin viruses). The authors noted a frameshifting deletion (orange/grey) in one of the bat viruses, which provides strong evidence against conserved protein-coding function, but they interpreted it (without evidence) as a potential sequencing error and excluded the strain from consideration. Even ignoring the frameshift-containing strain, the evidence used is insufficient to reach statistical significance: the alignment includes only 9 substitutions, of which 4 are radical, 4 are conservative, and 1 is synonymous. In a neutrally-evolving region with 9 substitutions, we would expect 2–3 synonymous changes, depending on the evolutionary model used, and a depletion to only 1 synonymous change is not statistically significant (nominal p-value>0.18 even in the most generous evolutionary model). This already-non-significant nominal p-value would move even further from significance with the necessary multiple-hypothesis corrections.

**Extended Data Figure 4.. Nucleocapsid-overlapping ORF9c is not protein-coding.**
Sarbecovirus alignment of frame2-encoded ORF9c (top), which overlaps frame3-encoded Nucleocapsid (bottom). ORF9c start codon is lost in one strain, and most strains have an earlier UAG stop codon (magenta) 3 codons before the end. In Nucleocapsid-encoding frame 2 (bottom), nearly all nucleotide substitutions are amino-acid-preserving (synonymous, light green), indicating strong purifying selection for protein-coding function. By contrast, in ORF9c-encoding frame 3 (top), nearly all nucleotide substitutions result in function-disrupting (radical) amino acid changes (red), and very few result in synonymous (light green) or function-preserving (conservative, dark green) substitutions, indicating lack of purifying selection for protein-coding function for ORF9c, so it does not play conserved protein-coding functions. In addition, ORF9c is unlikely to be translated via leaky ribosomal scanning because its start codon is 460 nucleotides after N’s (red arrow) with 9 intervening AUG codons (green dots), direct-RNA sequencing found no ORF9c-specific subgenomic RNAs^–, no TRS is appropriately positioned to create one, and several SARS-CoV-2 isolates contain stop-introducing mutations, indicating that ORF9c is not a recently-evolved strain-specific gene either. We conclude 9c is not protein-coding.

**Extended Data Figure 5.. Nucleocapsid-overlapping ORF9b is protein-coding.**
Sarbecovirus alignment of frame3-encoded ORF9b (top), which overlaps frame2-encoded Nucleocapsid (bottom). Although ORF9b-encoding frame3 shows many function-disrupting (radical, red) substitutions, its start codon (red box) is perfectly conserved, its stop codon (blue box) is perfectly conserved, and there are no intermediate stop codons in any strain. Moreover, its Kozak start-codon context (dashed black box) is optimal for ribosomal start codon recognition, with A in position -3 and G in position +4 (green boxes), while the start codon context of N is less optimal, with an A in -3 and T in +4 (orange boxes), making it likely that ORF9b can be translated by leaky scanning from the same subgenomic RNA as N, as it is only ~2 codons downstream of N’s start. Moreover, both the optimal 9b start-codon context, and the less-optimal N start-codon context are fully-conserved features across all Sarbecovirus strains, indicating that leaky-scanning translation may be a conserved feature throughout Sarbecoviruses. In addition, ORF9b shows significant localized synonymous constraint in N in its start and end regions (Fig. 3), even relative to the overall low synonymous rate of N, consistent with dual-coding functions. ORF9b also has proteomics support^,, in SARS-CoV-2, including evidence of viral-RNA binding, and alternate-frame translation support by ribosome profiling. In SARS-CoV, ORF9b protein (and antibodies to it) was detected in SARS patients^,, localized in mitochondria, and interfered with host cell antiviral response when overexpressed. We conclude ORF9b encodes a conserved functional protein-coding gene.

**Extended Data Figure 6.. ORF3b is not protein-coding.**
Sarbecoviruses alignment of SARS-CoV 154-codon ORF3b overlapping ORF3a, (reordered with SARS-CoV and related strains on top). Although start codon is conserved in all but one strain, ORF length is highly variable due to numerous in-frame stop codons (red ovals and red rectangle). The 22codon ORF in SARS-CoV-2 has strongly negative PhyloCSF score, does not overlap any SCEs, and even among the four strains sharing its stop codon (blue rectangle) all six substitutions are radical amino acid changes, providing no evidence of amino-acid-level purifying selection. Ribosome profiling did not find translation of ORF3b, transcription studies did not find substantial transcription of an ORF3b-specific subgenomic RNA, and translation by leaky scanning would implausibly require ribosomal bypass of eight AUG codons (green rectangles, top panel), some with strong Kozak context. (Supplementary Fig. S3 has comparison to reading frame of ORF3a.)

**Extended Data Figure 7.. ORF3d is not protein-coding.**
Sarbecovirus alignment of 57-codon ORF3d (referred to by some authors as 3b) overlapping ORF3a shows mostly function-altering radical amino-acid substitutions (red columns), and repeated interruption of by one or more premature stop codons in all other strains (red ovals), unambiguously indicating that ORF3d is not a conserved protein-coding gene. A substantial fraction of SARS-CoV-2 isolates have stop-introducing mutations, and ribosome profiling did not identify ORF3d as a translated ORF, indicating that it is not a recently-evolved strain-specific gene either.

**Extended Data Figure 8.. Branch-length-adjusted PhyloCSF score strongly rejects ORF10.**
Similar to Fig. 1c, but showing PhyloCSF scores per codon divided by the average number of substitutions per site, to adjust for the fact that high-nucleotide-conservation regions show compressed unscaled PhyloCSF scores (closer to zero) because there are fewer nucleotide substitution events. The branch-length-scaled score distribution further separates the scores of confirmed protein-coding genes (green) from non-protein-coding segments (red). The very low score of ORF10 with this adjustment indicates that its only-slightly-negative unscaled-PhyloCSF score in Fig. 1c stems from the high nucleotide conservation of the region, rather than protein-coding constraint. The scores of N-overlapping ORFs 9b and 9c are both reduced, consistent with the high nucleotide conservation of N. Notably, the branch-length-adjusted score for 3c remains high, consistent with its protein-coding nature, and despite the higher overall nucleotide conservation of its dual-coding region. We have manually inspected all other candidates with adjusted scores higher than 9c, and all are rejected (as not protein-coding): two are discussed in Supplementary Figure S4 (and are not protein-coding), and the remaining all show internal stop codons (and are not protein-coding).

**Extended Data Figure 9.. Single nucleotide variants and conservation.**
Error bars indicate standard error of mean. a. Density of SNVs disrupting conserved amino acids (dark red) is significantly lower than disrupting non-conserved amino acids (light red). Both densities are higher near the 3’ end of the genome, indicating higher mutation rate or less purifying selection even among amino acids that are perfectly conserved in Sarbecovirus. b. Density of synonymous variants in synonymously constrained codons (dark green) is significantly lower than among synonymously unconstrained codons (light green), a depletion seen in most genes. Overall, conservation in the Sarbecovirus clade at both the amino acid level and nucleotide level is associated with purifying selection on variants in the SARSCoV-2 population. c. Alignment of 20 amino acid Nucleocapsid region that is highly enriched for variants disrupting perfectly conserved amino acids (alternate alleles shown in second row, W = A or T, K = G or T). There are 14 non-synonymous variants among the 14 perfectly conserved amino acids (columns with no red or dark green). This region is contained within a predicted B Cell epitope, suggesting positive selection for immune system avoidance.

**Figure 1.. Overview.**
a. Previously annotated named (black font) and unnamed or proposed (blue font) SARS-CoV-2 genes, with confirmed protein-coding (green), rejected (red), or novel protein-coding (purple) classification, using evolutionary and experimental evidence. b. Phylogenetic Codon Substitution Frequencies (PhyloCSF) scores distinguish protein-coding (left) vs. non-coding (right) using evolutionary signatures, including distinct frequencies of amino-acidpreserving (green) vs. amino-acid-disruptive (red) substitutions, and stop codons (cyan/magenta/yellow) in frame-specific alignments, and additional features. c. PhyloCSF score (x-axis) for all confirmed (green) and rejected (red) ORFs, showing annotated/hypothetical/novel (labeled) and all AUG-initiated ≥25-codons-long locally-maximal ORFs (unlabelled). Novel ORF3c (purple) clusters with protein-coding. Only-modestly-negative ORF9c/ORF10 scores are artifacts of score compression in high-nucleotide-constraint regions, and substantially drop when nucleotide-conservation-scaled (see Extended Data Fig. 8).

**Figure 2.. Genome-wide protein-coding signatures.**
SARS-CoV-2 NCBI/UniProt genes (blue), unannotated proposed genes and mapped SARS-CoV genes (black, panel b only), frame-specific protein-coding PhyloCSF scores (green), Synonymous Constraint Elements (SCEs) (red), and phastCons/phyloP nucleotide-level constraint (green/blue/red) across genomic coordinates (x-axis) for entire genome (panel a) and final 4-kb subset (panel b, dashed black box), highlighting (light blue boxes): **(a)** strong protein-coding signal in correct frame for each named gene; conservation-signal frame-change at programmed frameshift site; strong protein-coding signal throughout S despite lack of nucleotide conservation in S1; **(b)** unambiguous and frame-specific protein-coding signal for unnamed ORFs 3a (despite only partial nucleotide conservation), 7a, 7b, and 8 (despite lack of nucleotide conservation); clear protein-coding signal in first half and last quarter of ORF6; no protein-coding signal for 10 (despite high nucleotide conservation); synonymous constraint (red) in novel-ORF 3c and confirmed-ORF 9b; no synonymous constraint in rejected ORFs 9c, 3b, 3d.

**Figure 3.. Synonymous constraint in Nucleocapsid overlaps 9b but not 9c/14.**
Synonymous substitution rate in 9-codon windows (y-axis) across N (x-axis), normalized to gene-wide average (dotted black line). Synonymous constraint elements (blue) expected for dual-coding constraint localize in overlapping ORF9b (dashed orange rectangle) indicating it is protein-coding, but not 9c (dashed purple rectangle) indicating it is not protein-coding. PhyloCSF protein-coding signal (green) in frame3 (encoding 9b and 9c/14) remains strongly negative throughout the length of 9c/14 (green box), indicating 9c/14 is non-coding, but rises to near-zero values for two regions of 9b, indicating protein-coding selection, while PhyloCSF signal frame 2 (encoding N) remains consistently high throughout the length of ORF9c.

**Figure 4.. Novel gene 3c overlapping 3a is protein-coding.**
a. Synonymous-constraint elements (blue) match nearly-perfectly 41-codon ORFc dual-coding region boundaries (black), and protein-coding evolutionary signatures (green) switch between frame 1 and 2 (rows) in the dual-coding region, with frame-2 signal (negative flanking ORF3c) increasing to near-zero, and frame-1 signal (high flanking the dual-coding region) dropping to near-zero. **b,c**. Codon-resolution evolutionary signatures (colors, CodAlignView) annotating genomic alignment (letters) spanning ORF3a start and dual-coding region, in frame-1 (top) and frame-2 (bottom), highlighting (blue boxes): (b, frame-2, ORF3c) radical codon substitutions (red) and stop codons (yellow, magenta, cyan) prior to ORF3c start; synonymous (light green) and conservative (dark green) substitutions in ORF3c; ORF3c’s start codon is conserved, except in one strain (row 4) with near-cognate GUG; ORF3c’s stop codon is conserved except for one-codon extension in two strains (rows 2–3); no intermediate stop codons in ORF3c; (c, frame-1, ORF3a) abundant synonymous and conservative substitutions in 3a prior to dual-coding region; increase in fully-conserved codons (white) over dual-coding region. Short interval (61nt) with only one weak-Kozak-context intervening start codon indicates ORF3c may be translated from ORF3a’s subgenomic RNA via leaky scanning.

**Figure 5.. Within-strain variation vs. inter-strain divergence.**
**a. Gene-level comparison**. Long-term inter-strain evolutionary divergence (x-axis) and short-term within-strain variation (y-axis) show strong agreement (linear regression dotted line, Spearman-correlation=0.70) across mature proteins (crosses, denoting standard error of mean on each axis), indicating that Sarbecovirus-clade selective pressures persist in the current pandemic. Well-characterized genes (black) show fewer changes in both timescales (bottom left) and less-well-characterized ORFs (blue) show more in both (top right). Significantly-deviating exceptions are: nsp3 and S1 (bottom right) showing significantly-fewer amino-acid-changing SNVs than expected from their cross-Sarbecovirus rapid evolution, and N (top left), showing significantly-more, possibly due to accelerated evolution in the current pandemic. **b. Rapidly-evolving Nucleocapsid region**. Top: Nucleocapsid context showing B-cell epitope predictions (black, “IEDB Predictions” track), and our annotation track-hub showing: conserved amino acids (red blocks), synonymously-constrained codons (green blocks), and SNV classification (colored tick-marks) as conserved/non-conserved (dark/light) and missense/synonymous (red/green); top 3 tracks show AUG codons (green) and stop codons (red) in three frames. Bottom: Focus on 20-amino-acid region R185-G204 (dotted box) in predicted B-cell epitope (black) significantly-enriched for amino-acid-changing variants (red) disrupting perfectly-conserved residues, indicative of positive selection in SARS-CoV-2 for immune system avoidance. **c. Spike D614G evolutionary context**. Sarbecovirus alignment (text) surrounding Spike D614G amino-acid-changing SNV, which rose in frequency in multiple geographic locations suggesting increased transmissibility. This A-to-G SNV disrupts a perfectly-conserved nucleotide (bold font, A-to-G), which disrupts a perfectly-conserved amino-acid (red box, D-to-G), in a perfectly-conserved 11-amino-acid region (dotted black box, light-green=synonymous-substitutions) across bat-host Sarbecoviruses, indicating D614G represents a human-host-adaptive mutation.

See this image and copyright information in PMC

References

1. Wu F. et al. A new coronavirus associated with human respiratory disease in China. Nature 579, 265–269 (2020). - PMC - PubMed
1. Baranov P. V. et al. Programmed ribosomal frameshifting in decoding the SARS-CoV genome. Virology 332, 498–510 (2005). - PMC - PubMed
1. Miller W. A. & Koev G. Synthesis of subgenomic RNAs by positive-strand RNA viruses. Virology 273, 1–8 (2000). - PubMed
1. Sawicki S. G., Sawicki D. L. & Siddell S. G. A contemporary view of coronavirus transcription. J. Virol. 81, 20–29 (2007). - PMC - PubMed
1. Lu R. et al. Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding. Lancet 395, 565–574 (2020). - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

SARS-CoV-2 gene content and COVID-19 mutation impact by comparing 44 Sarbecovirus genomes

Affiliations

SARS-CoV-2 gene content and COVID-19 mutation impact by comparing 44 Sarbecovirus genomes

Authors

Affiliations

Update in

Abstract

Conflict of interest statement

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous