Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Nov 10;15(11):e0240345.
doi: 10.1371/journal.pone.0240345. eCollection 2020.

Large scale genomic analysis of 3067 SARS-CoV-2 genomes reveals a clonal geo-distribution and a rich genetic variations of hotspots mutations

Affiliations

Large scale genomic analysis of 3067 SARS-CoV-2 genomes reveals a clonal geo-distribution and a rich genetic variations of hotspots mutations

Meriem Laamarti et al. PLoS One. .

Abstract

In late December 2019, an emerging viral infection COVID-19 was identified in Wuhan, China, and became a global pandemic. Characterization of the genetic variants of SARS-CoV-2 is crucial in following and evaluating it spread across countries. In this study, we collected and analyzed 3,067 SARS-CoV-2 genomes isolated from 55 countries during the first three months after the onset of this virus. Using comparative genomics analysis, we traced the profiles of the whole-genome mutations and compared the frequency of each mutation in the studied population. The accumulation of mutations during the epidemic period with their geographic locations was also monitored. The results showed 782 variants sites, of which 512 (65.47%) had a non-synonymous effect. Frequencies of mutated alleles revealed the presence of 68 recurrent mutations, including ten hotspot non-synonymous mutations with a prevalence higher than 0.10 in this population and distributed in six SARS-CoV-2 genes. The distribution of these recurrent mutations on the world map revealed that certain genotypes are specific to geographic locations. We also identified co-occurring mutations resulting in the presence of several haplotypes. Moreover, evolution over time has shown a mechanism of mutation co-accumulation which might affect the severity and spread of the SARS-CoV-2. The phylogentic analysis identified two major Clades C1 and C2 harboring mutations L3606F and G614D, respectively and both emerging for the first time in China. On the other hand, analysis of the selective pressure revealed the presence of negatively selected residues that could be taken into considerations as therapeutic targets. We have also created an inclusive unified database (http://covid-19.medbiotech.ma) that lists all of the genetic variants of the SARS-CoV-2 genomes found in this study with phylogeographic analysis around the world.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Distribution of the 3,067 genomes used in this study by country and date of isolation.
A) The pie chart represents the percentage of genomes used in this study according to their geographic origins. The colors indicate different countries. B) Number of genomes of complete pathogens, distributed over a period of 3 months from the end of December to the end of March.
Fig 2
Fig 2. Schematic representation illustrating the distribution of recurrent non-synonymous mutations along the SARS-CoV-2 genome.
The brown and garnet diagrams illustrate the non-structural proteins (nsp1 to nsp 16) of the orf1ab protein and the two subunits of the spike protein, respectively. Recurrent mutations represented by vertical lines. The frequency of each mutation in the population is presented by color coded circles. Abbreviations: S, spike; E, enveloppe; M, membrane protein; N, nucleocapsid protein; CT, Cytoplasmic chail.
Fig 3
Fig 3. Map showing geographical distribution of recurrent mutation in the studied population worldwide.
The pie charts show the relative frequencies of haplotype for each population. The haplotypes are color coded as shown in the key. The double-digit represent countries' two letters code. The circle's size was randomly generated with no association with the number of genomes in each country. Abbreviations: S, spike; E, enveloppe; M, membrane protein; N, nucleocapsid protein.
Fig 4
Fig 4. The graph represents substitutions accumulation in a three months period.
A) The accumulation of mutations increases linearly with time. The dots represent the number of mutations in each genome. All substitutions have been included: non-synonymous, synonymous and intergenic mutations. B) The distribution and accumulation of Hot spot mutations over time.
Fig 5
Fig 5. Phylogenetic analysis of 3067 SARS-CoV 2 genomes grouped according to the country of origin.
The length of the branches represents the distance in time.
Fig 6
Fig 6. Heatmap showing the correlation between mutations and the geographic distribution of the genomes analyzed.
The correlation was applied to a data set of 68 most recurrent mutations with different distribution in all 55 countries divided into two distinct cluster A and B. The color scale indicates the significance of correlation with blue and orange colors indicating the highest and lowest correlation. The red, yellow and orange colors in the horizontal bar represent the continent of origin. Abbreviations: S, spike; M, membrane protein; N, nucleocapsid protein.
Fig 7
Fig 7. Structural view of selective pressure in orf1ab gene.
The residue under the positive and negative selection is highlighted in blue and red respectively. The modeling of orf1ab non-structural proteins (nsp3, nsp4, nsp6, nsp12, nsp13, nsp14, and nsp16) harboring residues under pressure selection was produced using CI-TASSER. A. The nsp3 domains MAC1, Ubl1, Ubl2-PLpro, and SUD-C are color-coded in the 3D representation. The residues Ile-1426 and Ala-655 under negative selection are located respectively on 3Eco and SUD-C domains while Thr-353 residue under positive selection is shown on the MAC1 domain. Likewise, B, C, D, E, F, and G illustrating 3D representation of the nsp4, nsp6, nsp12, nsp13, nsp14 and nsp 16 proteins, respectively.
Fig 8
Fig 8. Structural view of selective pressure in spike gene.
The negatively selected site in spike protein is highlighted in red. The only amino acid residue selected negatively on the receptor-binding domain corresponds to GLN-474. The cryo-EM structure with PDB id 6VSB was used as a model for the spike gene in its prefusion conformation.
Fig 9
Fig 9. Pangenome construction of different strains belonging to the genus Betacoronavirus.
A. The Venn diagram represents the shared and unique proteins of SARS-CoV-2 compared to the 16 species of the genus Betacoronavirus. B. The pie diagram showing the core (present in all strains) and accessory proteins (not present in all strains) at the intragenomic scale of SARS-CoV-2.

References

    1. World Health Organization Coronavirus disease (COVID-19) Situation Report– 102, 01 Mai 2020. World Health Organization. 2020. Available from: https://www.who.int/docs/default-source/coronaviruse/situation-reports/2...
    1. Enjuanes LD, Cavanagh K, Holmes MMC, Lai H, Laude P, Masters P, et al. (2000) Coronaviridae. In: Virus taxonomy. Classification and nomemclature of viruses (van Regenmortel M. H. V., Fauquet C. M., Bishop D. H. L., Carstens E. B., Estes M. K., Lemon S. M., Maniloff J., Mayo M. A., McGeoch D. J., Pringle C. R., and Wickner R. B. eds.) Academic Press, San Diego: pp 835–849.
    1. Yeşilbağ K, Aytoğu G. Coronavirus host divergence and novel coronavirus (Sars-CoV-2) outbreak. Clinical and Experimental Ocular Trauma and Infection. 2020. April 23;2(1):1–9.
    1. Andersen KG, Rambaut A, Lipkin WI, Holmes EC, Garry RF. The proximal origin of SARS-CoV-2. Nat Med. 2020;26: 450–452. 10.1038/s41591-020-0820-9 - DOI - PMC - PubMed
    1. Wu F, Zhao S, Yu B, Chen YM, Wang W, Song ZG, et al. A new coronavirus associated with human respiratory disease in China. Nature. 2020;579:265–269. 10.1038/s41586-020-2008-3 - DOI - PMC - PubMed

Publication types

MeSH terms