Comparative Study

. 2021 May 11;12(1):2642.

doi: 10.1038/s41467-021-22905-7.

SARS-CoV-2 gene content and COVID-19 mutation impact by comparing 44 Sarbecovirus genomes

Irwin Jungreis^{1

2}, Rachel Sealfon³, Manolis Kellis^{4

5}

Affiliations

¹ MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA. iljungr@csail.mit.edu.
² Broad Institute of MIT and Harvard, Cambridge, MA, USA. iljungr@csail.mit.edu.
³ Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA.
⁴ MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA. manoli@mit.edu.
⁵ Broad Institute of MIT and Harvard, Cambridge, MA, USA. manoli@mit.edu.

PMID: 33976134
PMCID: PMC8113528
DOI: 10.1038/s41467-021-22905-7

Comparative Study

SARS-CoV-2 gene content and COVID-19 mutation impact by comparing 44 Sarbecovirus genomes

Irwin Jungreis et al. Nat Commun. 2021.

. 2021 May 11;12(1):2642.

doi: 10.1038/s41467-021-22905-7.

Authors

Irwin Jungreis^{1

2}, Rachel Sealfon³, Manolis Kellis^{4

5}

Affiliations

¹ MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA. iljungr@csail.mit.edu.
² Broad Institute of MIT and Harvard, Cambridge, MA, USA. iljungr@csail.mit.edu.
³ Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA.
⁴ MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA. manoli@mit.edu.
⁵ Broad Institute of MIT and Harvard, Cambridge, MA, USA. manoli@mit.edu.

PMID: 33976134
PMCID: PMC8113528
DOI: 10.1038/s41467-021-22905-7

Abstract

Despite its clinical importance, the SARS-CoV-2 gene set remains unresolved, hindering dissection of COVID-19 biology. We use comparative genomics to provide a high-confidence protein-coding gene set, characterize evolutionary constraint, and prioritize functional mutations. We select 44 Sarbecovirus genomes at ideally-suited evolutionary distances, and quantify protein-coding evolutionary signatures and overlapping constraint. We find strong protein-coding signatures for ORFs 3a, 6, 7a, 7b, 8, 9b, and a novel alternate-frame gene, ORF3c, whereas ORFs 2b, 3d/3d-2, 3b, 9c, and 10 lack protein-coding signatures or convincing experimental evidence of protein-coding function. Furthermore, we show no other conserved protein-coding genes remain to be discovered. Mutation analysis suggests ORF8 contributes to within-individual fitness but not person-to-person transmission. Cross-strain and within-strain evolutionary pressures agree, except for fewer-than-expected within-strain mutations in nsp3 and S1, and more-than-expected in nucleocapsid, which shows a cluster of mutations in a predicted B-cell epitope, suggesting immune-avoidance selection. Evolutionary histories of residues disrupted by spike-protein substitutions D614G, N501Y, E484K, and K417N/T provide clues about their biology, and we catalog likely-functional co-inherited mutations. Previously reported RNA-modification sites show no enrichment for conservation. Here we report a high-confidence gene set and evolutionary-history annotations providing valuable resources and insights on SARS-CoV-2 biology, mutations, and evolution.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1. Overview.**
a Coronavirus-wide (black font) and species-specific or candidate (blue font) SARS-CoV-2 genes, with confirmed protein-coding (green), rejected (red), or novel protein-coding (purple) classification, using evolutionary and experimental evidence. b Phylogenetic Codon Substitution Frequencies (PhyloCSF) scores distinguish protein-coding (left) vs. non-coding (right) using evolutionary signatures, including distinct frequencies of amino-acid-preserving (green) vs. amino-acid-disruptive (red) substitutions, and stop codons (cyan/magenta/yellow) in frame-specific alignments, and additional features. c PhyloCSF score (x-axis) for all confirmed (green) and rejected (red) ORFs, showing annotated/candidate/novel (labeled) and all AUG-initiated ≥25-codons-long locally maximal ORFs (unlabeled). Novel ORF3c (purple) clusters with protein-coding. Only modestly negative ORF9c/ORF10 scores are artifacts of score compression in high-nucleotide-constraint regions, and substantially drop when nucleotide-conservation-scaled (see Supplementary Fig. 3).

**Fig. 2. Genome-wide protein-coding signatures.**
SARS-CoV-2 NCBI/UniProt genes (blue), unannotated candidate genes and mapped SARS-CoV genes (black, panel b only), frame-specific protein-coding PhyloCSF scores (green), Synonymous Constraint Elements (SCEs) (blue), and phastCons/phyloP nucleotide-level constraint (green/blue/red) across genomic coordinates (x-axis) for entire genome (panel a) and final 4-kb subset (panel b, dashed black box): a strong protein-coding signal in correct frame for each named gene; conservation-signal frame-change at programmed frameshift site; strong protein-coding signal throughout S despite lack of nucleotide conservation in S1; b unambiguous and frame-specific protein-coding signal for ORFs 3a (despite only partial nucleotide conservation), 7a, 7b, and 8 (despite lack of nucleotide conservation); clear protein-coding signal in first half and last quarter of ORF6; no protein-coding signal for 10 (despite high nucleotide conservation); synonymous constraint (blue) in novel-ORF 3c and confirmed-ORF 9b; no synonymous constraint in rejected ORFs 9c, 3b, 3d.

**Fig. 3. Phylogenetic tree of 44 *Sarbecovirus* genomes and larger phylogenetic context.**
Left: Phylogenetic tree of a selection of *Orthocoronavirinae* genomes, including the seven that infect humans (red asterisks). Right: Phylogenetic tree of the 44 *Sarbecovirus* genomes used in this study (all belong to the species *Severe acute respiratory syndrome-related coronavirus*). Trees are based on whole-genome alignments and might be different from the history at particular loci, due to recombination.

**Fig. 4. Protein-coding decision flow chart.**
Flow chart indicates main steps in determining if an ORF encodes a functional protein (light green ovals), is not protein-coding (red ovals), or is translated but with ambiguous protein-coding status (yellow oval), with cases for conserved non-overlapping, conserved overlapping, and non-conserved ORFs. Decisions are based on sequence features (blue rectangles), evolutionary signatures across *Sarbecovirus* (orange rectangles), within-strain variants (dark green rectangle), or experimental evidence (purple rectangles). Actual process considers additional details (Supplementary Note 2).

**Fig. 5. ORF10 is not protein-coding.**
a Alignment of *Sarbecovirus* genomes at ORF10, including 30nt on each side. Most substitutions are radical (red) or conservative (dark green) amino-acid-changing, with only two synonymously changing positions (light green), indicating this is not a conserved protein-coding ORF. Nearly all strains show an earlier stop codon (cyan), further reducing the length of this already-short ORF from 38 codons to 25, and another strain includes a frame-shifting deletion (orange). Putative partial transcription-regulatory sequence (TRS) present in SARS-CoV-2 and Bat CoV RaTG13 is not present in other strains. The surrounding region shows high nucleotide-level conservation, spanning ORF10 and extending beyond its boundaries in both directions, indicating this region is functionally important even though it does not encode protein (indeed, it is part of a known RNA structure). b Ribosome footprints previously used to suggest that ORF10 might be translated in fact localize either in an upstream ORF (uORF, green) or in an internal ORF (green, “final predictions” track), but density in the unique portion of ORF10 (dashed black box) is no greater than after the stop codon (red box), indicating they are less likely to reflect the functional translation of ORF10, and more likely to represent incidental translation initiation events. The internal ORF is only 18 codons long in 4 strains, and 5 in the other strains, given the early stop codon (purple box), and unlikely to be functional. Footprint tracks show elongating ribosome footprints in cells treated with cycloheximide (blue, CHX), and footprints enriched for initiating ribosomes using harringtonine (Harr, red), and lactimidomycin (LTM, green). “mRNA-seq” track shows RNA-seq reads. c Alignment of six closely related strains (SARS-CoV-2, three bat viruses, two pangolin viruses) previously used to argue that high dN/dS ratio in ORF10 indicated positive selection for protein-coding-like rapid evolution. A frameshifting deletion (orange/gray) in one bat virus militates against conserved protein-coding function. Even ignoring that strain, the evidence is not statistically significant: the alignment includes only 9 substitutions, including 1 synonymous. In a neutrally evolving region with 9 substitutions, we would expect 2–3 synonymous changes, and a depletion to only 1 is not statistically significant even without multiple-hypothesis correction (P > 0.18).

**Fig. 6. Nucleocapsid-overlapping ORF9b is protein-coding but not ORF9c.**
a Synonymous substitution rate in 9-codon windows (y-axis) across N (x-axis), normalized to gene-wide average (dotted black line). Two small synonymous constraint elements (SCEs, blue) expected for dual-coding regions localize near ends of overlapping 97-codon ORF9b (dashed orange rectangle), but the synonymous rate is high in the central portion. No SCEs localize to 73-codon ORF9c (dashed green rectangle). PhyloCSF protein-coding signal (green) in frame 3 (encoding ORF9b and ORF9c) remains strongly negative throughout ORF9c but rises to near-zero for two regions of ORF9b, while the N-encoding frame-2 signal remains consistently high throughout ORF9c. b *Sarbecovirus* alignment of ORF9c. Start codon is lost in one strain, and most have a UAG stop codon (magenta) 3 codons before the end. Nearly all substitutions are function-disrupting amino acid changes (red), and very few are synonymous (light green) or conservative (dark green), consistent with lack of PhyloCSF signal and synonymous constraint, indicating ORF9c does not play conserved protein-coding functions. Translation via leaky scanning is unlikely because ORF9c’s start is 460 nucleotides after N’s with 9 intervening AUGs (Supplementary Fig. 6), direct-RNA sequencing found no ORF9c-specific subgenomic RNAs^–, and several SARS-CoV-2 isolates contain stop-introducing mutations, indicating ORF9c is not a recently evolved strain-specific gene either. c *Sarbecovirus* alignment of ORF9b. Although ORF9b shows many function-disrupting substitutions, its start (red box) and stop codons (blue box) are perfectly conserved, with no intermediate stop codons in any strain. Its Kozak start-codon context (dashed black box) is optimal for ribosomal recognition (A/G in positions −3/+4, green boxes), while context of N is less optimal (A/T in positions −3/+4, orange boxes), with both contexts conserved across *Sarbecovirus* and no intervening AUGs, so ORF9b can be translated by leaky scanning from N’s subgenomic RNA. ORF9b has ribosome profiling and proteomics^,, support in SARS-CoV-2, and experimental support in SARS-CoV^–. Although high synonymous rate in N in central portion of ORF9b is unexpected for a dual coding region, synonymous constraint and near-zero PhyloCSF signal near its ends, and other evidence, suggest it is a conserved functional protein-coding gene, though one with high evolutionary rate in the central portion.

**Fig. 7. Novel gene 3c overlapping 3a is protein-coding.**
a Synonymous constraint elements (blue) match nearly perfectly 41-codon ORFc dual-coding region boundaries (black), and PhyloCSF protein-coding evolutionary signatures (green) switch between frame 1 and 2 (rows) in the dual-coding region, with frame-2 signal (negative flanking ORF3c) increasing to near-zero, and frame-1 signal (high flanking ORF3c) dropping to near-zero. b, c Codon-resolution evolutionary signatures (colors, CodAlignView) annotating genomic alignment (letters) spanning ORF3a start and dual-coding region, in frame-1 (top) and frame-2 (bottom), highlighting (yellow boxes): (b, frame-2, ORF3c) radical codon substitutions (red) and stop codons (yellow, magenta, cyan) prior to ORF3c start; synonymous (light green) and conservative (dark green) substitutions in ORF3c; ORF3c’s start codon is conserved, except in one strain (row 4) with near-cognate GUG; ORF3c’s stop codon is conserved except for one-codon extension in two strains (rows 2–3); no intermediate stop codons in ORF3c; (c, frame-1, ORF3a) abundant synonymous and conservative substitutions in ORF3a prior to dual-coding region; increase in fully conserved codons (white) over dual-coding region indicating synonymous constraint. Short 61-nucleotide (nt) interval with only one weak-Kozak-context intervening start codon indicates ORF3c may be translated from ORF3a’s subgenomic RNA via leaky scanning.

**Fig. 8. SARS-CoV-2 ORF3b is not protein-coding.**
*Sarbecovirus* alignment of SARS-CoV 154-codon ORF3b overlapping ORF3a (reordered with SARS-CoV and related strains on top). Although the start codon is conserved in all but one strain, ORF length is highly variable due to numerous in-frame stop codons (red ovals and red rectangle). The 22-codon ORF in SARS-CoV-2 has strongly negative PhyloCSF score, does not overlap any SCEs, and even among the four strains sharing its stop codon (blue rectangle) all six substitutions are radical amino acid changes, providing no evidence of amino-acid-level purifying selection. Ribosome profiling did not predict translation of ORF3b, transcription studies did not find substantial transcription of an ORF3b-specific subgenomic RNA, and translation by leaky scanning from the ORF3a subgenomic RNA would implausibly require ribosomal bypass of eight AUG codons (green rectangles, top panel), some with strong Kozak context. (Supplementary Fig. 9 has a comparison to the reading frame of ORF3a).

**Fig. 9. ORF3d is not protein-coding.**
*Sarbecovirus* alignment of 57-codon ORF3d (referred to by some authors as ORF3b) overlapping ORF3a shows mostly function-altering radical amino-acid substitutions (red columns), and repeated interruption by one or more premature stop codons in all other strains (red ovals), unambiguously indicating that ORF3d is not a conserved protein-coding gene. A substantial fraction of SARS-CoV-2 isolates have stop-introducing mutations, and ribosome profiling did not identify ORF3d as a translated ORF, indicating that it is not a recently evolved strain-specific gene either. There is ribosome profiling and other evidence of translation of ORF3d-2, beginning at a downstream AUG and thus avoiding the stop-introducing mutations. However, ORF3d-2 is not conserved, is only 33 codons long, and lacks evidence that its translation product contributes to viral fitness.

**Fig. 10. Within-strain variation vs. inter-strain divergence.**
a Gene-level comparison. Long-term inter-strain evolutionary divergence (x-axis) and short-term within-strain variation (y-axis) show strong agreement (linear regression dotted line, Spearman-correlation = 0.70) across mature proteins (crosses, denoting standard error of mean on each axis), indicating that *Sarbecovirus*-clade selective pressures persist in the current pandemic. Well-characterized coronavirus-wide genes (black) show fewer changes in both timescales (bottom left) and less-well-characterized ORFs (blue) show more in both (top right). Significantly deviating exceptions are: nsp3 and S1 (bottom right) showing significantly-fewer amino-acid-changing SNVs than expected from their cross-*Sarbecovirus* rapid evolution, and N (top left), showing significantly-more, possibly due to accelerated evolution in the current pandemic. b Rapidly evolving nucleocapsid region. Top: nucleocapsid-gene context showing B-cell epitope predictions (black, “IEDB Predictions” track), and our annotation track-hub showing: conserved amino acids (red blocks), synonymously constrained codons (green blocks), and SNV classification (colored tick-marks) as conserved/non-conserved (dark/light) and missense/synonymous (red/green); top 3 tracks show AUG codons (green) and stop codons (red) in three frames. Bottom: Focus on 20-amino-acid region R185-G204 (dotted box) in predicted B-cell epitope (black) significantly enriched for amino-acid-changing mutations (red) disrupting perfectly conserved residues, indicative of positive selection in SARS-CoV-2 for immune system avoidance. c Spike D614G evolutionary context. *Sarbecovirus* alignment (text) surrounding spike-protein D614G amino-acid-changing SNV, which rose in frequency in multiple geographic locations suggesting increased transmissibility. This A-to-G SNV disrupts a perfectly conserved nucleotide (bold font, A-to-G), which disrupts a perfectly conserved amino-acid (red box, D-to-G), in a perfectly conserved 11-amino-acid region (dotted black box, light-green = synonymous-substitutions) across bat-host sarbecoviruses, suggesting D614G might represent a human-host-adaptive mutation.

See this image and copyright information in PMC

Update of

SARS-CoV-2 gene content and COVID-19 mutation impact by comparing 44 Sarbecovirus genomes.
Jungreis I, Sealfon R, Kellis M. Jungreis I, et al. bioRxiv [Preprint]. 2020 Sep 2:2020.06.02.130955. doi: 10.1101/2020.06.02.130955. bioRxiv. 2020. Update in: Nat Commun. 2021 May 11;12(1):2642. doi: 10.1038/s41467-021-22905-7. PMID: 32577641 Free PMC article. Updated. Preprint.
SARS-CoV-2 gene content and COVID-19 mutation impact by comparing 44 Sarbecovirus genomes.
Jungreis I, Sealfon R, Kellis M. Jungreis I, et al. Res Sq [Preprint]. 2020 Oct 1:rs.3.rs-80345. doi: 10.21203/rs.3.rs-80345/v1. Res Sq. 2020. Update in: Nat Commun. 2021 May 11;12(1):2642. doi: 10.1038/s41467-021-22905-7. PMID: 33024961 Free PMC article. Updated. Preprint.

References

1. Wu F, et al. A new coronavirus associated with human respiratory disease in China. Nature. 2020;579:265–269. doi: 10.1038/s41586-020-2008-3. - DOI - PMC - PubMed
1. Gorbalenya AE, et al. The species Severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2. Nat. Microbiol. 2020;5:536–544. doi: 10.1038/s41564-020-0695-z. - DOI - PMC - PubMed
1. de Groot, R. J. et al. Family Coronaviridae. In Virus Taxonomy: Ninth Report of the International Committee on Taxonomy of Viruses (eds King, A. M. Q., Adams, M. J., Carstens, E.B. & Lefkowitz, E. J.) 806–828 (Academic Press, 2012).
1. Baranov PV, et al. Programmed ribosomal frameshifting in decoding the SARS-CoV genome. Virology. 2005;332:498–510. doi: 10.1016/j.virol.2004.11.038. - DOI - PMC - PubMed
1. Snijder EJ, et al. Unique and conserved features of genome and proteome of SARS-coronavirus, an early split-off from the Coronavirus Group 2 lineage. J. Mol. Biol. 2003;331:991–1004. doi: 10.1016/S0022-2836(03)00865-9. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Medical
- MedlinePlus Health Information
Molecular Biology Databases
- GlyGen glycoinformatics resource
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

SARS-CoV-2 gene content and COVID-19 mutation impact by comparing 44 Sarbecovirus genomes

Affiliations

SARS-CoV-2 gene content and COVID-19 mutation impact by comparing 44 Sarbecovirus genomes

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Update of

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical

Molecular Biology Databases

Miscellaneous