Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Feb 10;11(1):3487.
doi: 10.1038/s41598-021-83105-3.

Genomic mutations and changes in protein secondary structure and solvent accessibility of SARS-CoV-2 (COVID-19 virus)

Affiliations

Genomic mutations and changes in protein secondary structure and solvent accessibility of SARS-CoV-2 (COVID-19 virus)

Thanh Thi Nguyen et al. Sci Rep. .

Abstract

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a highly pathogenic virus that has caused the global COVID-19 pandemic. Tracing the evolution and transmission of the virus is crucial to respond to and control the pandemic through appropriate intervention strategies. This paper reports and analyses genomic mutations in the coding regions of SARS-CoV-2 and their probable protein secondary structure and solvent accessibility changes, which are predicted using deep learning models. Prediction results suggest that mutation D614G in the virus spike protein, which has attracted much attention from researchers, is unlikely to make changes in protein secondary structure and relative solvent accessibility. Based on 6324 viral genome sequences, we create a spreadsheet dataset of point mutations that can facilitate the investigation of SARS-CoV-2 in many perspectives, especially in tracing the evolution and worldwide spread of the virus. Our analysis results also show that coding genes E, M, ORF6, ORF7a, ORF7b and ORF10 are most stable, potentially suitable to be targeted for vaccine and drug development.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
Protein coding genes of SARS-CoV-2, which consist of 7 nonstructural genes (ORF1ab, ORF3a, ORF6, ORF7a, ORF7b, ORF8 and ORF10) and 4 structural genes (S, E, M, and N). ORF1ab polyprotein is coded by gene ORF1ab at locations 266–21555 (based on the reference genome sequence NC_045512), surface glycoprotein coded by gene S at locations 21563–25384, ORF3a protein by gene ORF3a (25393–26220), envelope protein by gene E (26245–26472), membrane glycoprotein by gene M (26523–27191), ORF6 protein by gene ORF6 (27202–27387), ORF7a protein by gene ORF7a (27394–27759), ORF7b protein by gene ORF7b (27756–27887), ORF8 protein by gene ORF8 (27894-28259), nucleocapsid phosphoprotein by gene N (28274-29533), and ORF10 protein by gene ORF10 (29558–29674).
Figure 2
Figure 2
Protein ORF1ab—The number of insertion, deletion and nonsynonymous mutations at different locations in the protein. Spikes at locations: T265I (1344), L3606F (271), P4715L (2576), P5828L (475) and Y5865C (476). Regions between these spikes are stable and could be targeted for vaccine and drug development.
Figure 3
Figure 3
Protein S—The number of insertion, deletion and nonsynonymous mutations at different locations in the protein. A spike at location D614G (3089) while other regions of the protein are stable.
Figure 4
Figure 4
The number of insertion, deletion and nonsynonymous mutations at different locations in the proteins ORF3a, E, M, ORF6, ORF7a, ORF7b, ORF8, N and ORF10. The number below protein names are the length of that protein. Protein ORF3a: two spikes at Q57H (2795) and G251V (206); protein ORF8: two spikes at S24L (320) and L84S (1000); protein N: R203K (876) and G204R (433); Other proteins E, M, ORF6, ORF7a, ORF7b and ORF10 are almost entirely stable.
Figure 5
Figure 5
Alignment of sequences having deletion mutations at positions M85-, V86- or K141-, S142-, F143-, which are major deletions in the ORF1ab protein (Table 2). The GenBank accession numbers are presented on the left while isolate names and collected dates are on the right. The numbers on top show the positions of AAs in the protein and isolates are ordered by collected dates. The first isolate having these deletions is USA-CA6/2020 (record MT044258 in second row), collected on 2020-01-27 in USA: CA. This is also the isolate having the largest number of deletions: five sequentially at G82-, H83-, V84-, M85-, V86- and three at K141-, S142-, F143-. The other patients followed were possibly infected by this first case but more data such as travel history are needed to confirm this hypothesis.
Figure 6
Figure 6
Insertion and deletion mutations in protein ORF6. The GenBank accession numbers, collected dates and isolate names are presented on the left. One synonymous mutation D61K and two insertions − 62R and − 63T at the end of isolate USA/MA_MGH_00184/2020 (MT520188) is interesting while there is a high chance that HKG/VM20001061/2020 has spread to USA/VA-DCLS-0294/2020.
Figure 7
Figure 7
Deletions in protein ORF7a with the GenBank accession numbers, collected dates and isolate names presented on the left. The large 14 sequential deletions in the isolate USA/VI-CDC-3884/2020 (MT507795) are worth a further study as its patient’s clinical data may show some difference with other COVID-19 patients.
Figure 8
Figure 8
Alignment of protein ORF8 sequences having insertions with the GenBank accession numbers, isolate names and collected dates presented on the left. There is a high chance that the isolate CHN/GZMU0042/2020 (MT568638, collected in China) has transmitted to USA/FL-BPHL-0059/2020 (MT507032 in USA).
Figure 9
Figure 9
Deletions in protein S, which are all outside the RBD region (319–541), suggesting that the RBD may have been evolutionarily optimized for the purpose of binding to a host cell. The numbers on top show the residue positions in the protein. The GenBank accession numbers, collected dates and isolate names are presented on the left.

References

    1. Wu F, et al. A new coronavirus associated with human respiratory disease in China. Nature. 2020;579:265–269. doi: 10.1038/s41586-020-2008-3. - DOI - PMC - PubMed
    1. Lu R, et al. Genomic characterisation and epidemiology of 2019 novel coronavirus: Implications for virus origins and receptor binding. Lancet. 2020;395:565–574. doi: 10.1016/S0140-6736(20)30251-8. - DOI - PMC - PubMed
    1. Zhou P, et al. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature. 2020;579:270–273. doi: 10.1038/s41586-020-2012-7. - DOI - PMC - PubMed
    1. Lam TT, et al. Identifying SARS-CoV-2 related coronaviruses in Malayan pangolins. Nature. 2020;583:282–285. doi: 10.1038/s41586-020-2169-0. - DOI - PubMed
    1. Zhang T, Wu Q, Zhang Z. Probable pangolin origin of SARS-CoV-2 associated with the COVID-19 outbreak. Curr. Biol. 2020;30(7):1346–1351.e2. doi: 10.1016/j.cub.2020.03.022. - DOI - PMC - PubMed