. 2021 Nov 26;11(1):23023.

doi: 10.1038/s41598-021-02548-w.

Epidemiological associations with genomic variation in SARS-CoV-2

Ali Rahnavard¹, Tyson Dawson², Rebecca Clement², Nathaniel Stearrett², Marcos Pérez-Losada^{2

3}, Keith A Crandall²

Affiliations

¹ Computational Biology Institute, Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA. rahnavard@gwu.edu.
² Computational Biology Institute, Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA.
³ CIBIO-InBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Campus Agrário de Vairão, Vairão, Portugal.

PMID: 34837008
PMCID: PMC8626494
DOI: 10.1038/s41598-021-02548-w

Epidemiological associations with genomic variation in SARS-CoV-2

Ali Rahnavard et al. Sci Rep. 2021.

. 2021 Nov 26;11(1):23023.

doi: 10.1038/s41598-021-02548-w.

Authors

Ali Rahnavard¹, Tyson Dawson², Rebecca Clement², Nathaniel Stearrett², Marcos Pérez-Losada^{2

3}, Keith A Crandall²

Affiliations

¹ Computational Biology Institute, Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA. rahnavard@gwu.edu.
² Computational Biology Institute, Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA.
³ CIBIO-InBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Campus Agrário de Vairão, Vairão, Portugal.

PMID: 34837008
PMCID: PMC8626494
DOI: 10.1038/s41598-021-02548-w

Abstract

SARS-CoV-2 (CoV) is the etiological agent of the COVID-19 pandemic and evolves to evade both host immune systems and intervention strategies. We divided the CoV genome into 29 constituent regions and applied novel analytical approaches to identify associations between CoV genomic features and epidemiological metadata. Our results show that nonstructural protein 3 (nsp3) and Spike protein (S) have the highest variation and greatest correlation with the viral whole-genome variation. S protein variation is correlated with nsp3, nsp6, and 3'-to-5' exonuclease variation. Country of origin and time since the start of the pandemic were the most influential metadata associated with genomic variation, while host sex and age were the least influential. We define a novel statistic-coherence-and show its utility in identifying geographic regions (populations) with unusually high (many new variants) or low (isolated) viral phylogenetic diversity. Interestingly, at both global and regional scales, we identify geographic locations with high coherence neighboring regions of low coherence; this emphasizes the utility of this metric to inform public health measures for disease spread. Our results provide a direction to prioritize genes associated with outcome predictors (e.g., health, therapeutic, and vaccine outcomes) and to improve DNA tests for predicting disease status.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Figure 1**
Maximum Likelihood analysis of the nsp3 region of the CoV genomes. (a) RAxML cladogram (branch lengths not proportional to change) showing relationships between SARS-CoV-2, MERS, Bat-SL-CoV, and SARS-related and rooted using a Beta Coronavirus outgroup. Sequence identity estimates between the representatives of clades for CoV families reveal regions with potential functional importance. (b) RAxML phylogram (branch lengths proportional to change) estimated from 2,007 sequences from the GISAID database, including proportional representatives of genomes from Pangolin major clades.

**Figure 2**
Subclade identification using CoV genome and gene variation in population of sample in our study. Subclade finding was performed using *omeClust* and enrichment score of metadata was measured based on the overlap of detected clades and metadata using normalized mutual information (NMI). (a) regions of CoV genome have been clustered using z-score of enrichment scores for three metadata variables available for all lineages. Regions such as S, nsp6, N, nsp3, ORF1a, ORF1ab are more similar to genomes using clusters of scaled enrichment scores. (b) *omeClust* identifies communities of CoV lineages that are mostly explained by organisms (NMI = 0.9). (c) Spike protein that facilitates binding and entering to host cells carries similar variation among organisms as the whole CoV genome. (d) nsp3 protein has a similar variation to S protein and can be targeted as a protein with an important biological function. *omeClust* detects four communities (points colors) corresponding to the four known organisms (points shapes).

**Figure 3**
Association among SARS-CoV-2 genome and genes variation. (a) The SARS-CoV-2 genome and 29 specific regions are used to structure dissimilarity among samples in the GISAID cohort. Relationships between variation explained among proteins, regions, and the whole genome of CoV using paired measurements with differences across subjects are quantified by Mantel tests (square of the Mantel statistic, see “Methods”). (b) the selection proportion (histogram bars) and the number of sites under selection (values above the bars) for each of the 21 specific regions detected by HyPhy on October 28, 2020. Spike protein and nsp3 are among the regions with a high number of sites under selection, while nsp10 and ORF6 regions have the lowest number of sites. The RNA-dependent RNA polymerase (RdRP) has the highest selection proposition from the HyPhy analysis, the number of sites under selection divided by the length of the gene region, which shows no association in our analyses. (c) the count of SARS-CoV-2 SNPs (in log scale) shows distinct patterns across genome regions. The 3′-to-5′ exonuclease, endoRNAse, 2′-O-Ribose methyltransferase, and Spike proteins have a heightened number of mutations. The red line is an arbitrary cutoff at log(8000) to emphasize large differences as we show the results in the log scale.

**Figure 4**
Association between SARS-CoV-2 genome regions and metadata. Distances among CoV genomes and 29 specific regions using GTR + G-based distances were used to assess relationships between variation explained between proteins, regions, and the whole genome of SARS-CoV-2 using paired measurements with differences across subjects by omnibus (PERMANOVA) test. White cells refer to scenarios where there was not sufficient variation to perform our analyses.

**Figure 5**
Quantification of coherence of lineages within a specified area compared to other areas. Higher coherence values indicate lower phylogenetic distance within a specific geographic region relative to other areas. (a) 15,721 viral genome sequences from infected individuals downloaded from GISAID on May 8th, 2020, and the sequencing data were aligned and used to compare the diversity of SARS-CoV-2 within countries compared to the rest of the world. (b) samples from each state of the US have been compared to the rest of the US to investigate the similarity of the virus lineages within each state. Several counties and states exhibited differentiation into specific clades. States or countries with darker colors likely show a higher level of community-driven spread. In contrast, states or countries with lower coherence (lighter colors) show a greater level of disease introduction from outside the region. The figure is implemented in *omicsArt*, a ggplot2 based R package.

**Figure 6**
Predicted protein structure from sequence data across coronavirus families. (a) proteins with high variation among coronaviruses tend to have different protein structures. Blank cells indicate proteins that could not be successfully modeled by SWISS-MODEL (b) amino acid composition in predicted secondary structures of proteins show different patterns among CoV genome proteins. Gray cells refer to proteins that contained stop codons in our alignment or were otherwise not amenable to structural analysis.

See this image and copyright information in PMC

References

1. Starr TN, et al. Deep mutational scanning of SARS-CoV-2 receptor binding domain reveals constraints on folding and ACE2 binding. Cell. 2020 doi: 10.1101/2020.06.17.157982. - DOI - PMC - PubMed
1. Cai Y, et al. Distinct conformational states of SARS-CoV-2 spike protein. Science. 2020;369:1586–1592. doi: 10.1126/science.abd4251. - DOI - PMC - PubMed
1. Ardeshirdavani, A. et al. Clinical population genetic analysis of variants in the SARS-CoV-2 receptor ACE2. medRxiv (2020).
1. de Cruz JO, de Oliveira Cruz J, Sousa SMB. SARS-CoV-2 receptor and renin-angiotensin system regulation: Impact of genetics variants in ACE2 gene impact of genetics variants in the ACE2 gene in the functional receptor of SARS-CoV-2. Int. J. Innov. Sci. Res. Technol. 2020;5:489–497. doi: 10.38124/IJISRT20JUL268. - DOI
1. Rosario, P. A. & McNaughton, B. R. Computational hot-spot analysis of the SARS-CoV-2 receptor binding domain/ACE2 complex. 10.1101/2020.08.06.240333 (2020). - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions

Supplementary concepts

Actions

Grants and funding

2028280/National Science Foundation

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Medical
- MedlinePlus Health Information
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Epidemiological associations with genomic variation in SARS-CoV-2

Affiliations

Epidemiological associations with genomic variation in SARS-CoV-2

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Supplementary concepts

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical

Miscellaneous