Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Apr;45(3):316-323.
doi: 10.1002/gepi.22373. Epub 2021 Jan 8.

Unsupervised cluster analysis of SARS-CoV-2 genomes reflects its geographic progression and identifies distinct genetic subgroups of SARS-CoV-2 virus

Affiliations

Unsupervised cluster analysis of SARS-CoV-2 genomes reflects its geographic progression and identifies distinct genetic subgroups of SARS-CoV-2 virus

Georg Hahn et al. Genet Epidemiol. 2021 Apr.

Abstract

Over 10,000 viral genome sequences of the SARS-CoV-2virus have been made readily available during the ongoing coronavirus pandemic since the initial genome sequence of the virus was released on the open access Virological website (http://virological.org/) early on January 11. We utilize the published data on the single stranded RNAs of 11,132 SARS-CoV-2 patients in the GISAID database, which contains fully or partially sequenced SARS-CoV-2 samples from laboratories around the world. Among many important research questions which are currently being investigated, one aspect pertains to the genetic characterization/classification of the virus. We analyze data on the nucleotide sequencing of the virus and geographic information of a subset of 7640 SARS-CoV-2 patients without missing entries that are available in the GISAID database. Instead of modeling the mutation rate, applying phylogenetic tree approaches, and so forth, we here utilize a model-free clustering approach that compares the viruses at a genome-wide level. We apply principal component analysis to a similarity matrix that compares all pairs of these SARS-CoV-2 nucleotide sequences at all loci simultaneously, using the Jaccard index. Our analysis results of the SARS-CoV-2 genome data illustrates the geographic and chronological progression of the virus, starting from the first cases that were observed in China to the current wave of cases in Europe and North America. This is in line with a phylogenetic analysis which we use to contrast our results. We also observe that, based on their sequence data, the SARS-CoV-2 viruses cluster in distinct genetic subgroups. It is the subject of ongoing research to examine whether the genetic subgroup could be related to diseases outcome and its potential implications for vaccine development.

Keywords: SARS-CoV-2; clustering; covid; jaccard.

PubMed Disclaimer

Conflict of interest statement

Conflict of Interest

The authors declare no conflict of interest.

Figures

Figure 1:
Figure 1:
First two principal components of the Jaccard similarity matrix for the 7, 640 SARS-CoV-2 genomes by region/country. Entire dataset (left) and zoomed-in region around the origin (0, 0) (right). Numbers in brackets for each country denote the number of SARS-CoV-2 genomes which are visible in each plot.
Figure 2:
Figure 2:
The x-axis shows the roughly 29, 000 nucleotides of the trimmed SARS-CoV-2 reference sequence in 50 bins. The y-axis shows per bin the normalized number of mismatches (with respect to the reference sequence) among the samples in the European and North American population, stratified into samples from the top branch (red), the second branch (green), the middle branch (blue), and the bottom branch (orange) visible in the left panel of Fig. 1. The normalization is done with respect to both the bin size and the number of samples in each branch.
Figure 3:
Figure 3:
Phylogenic analysis as tree computed on https://www.gisaid.org/epiflu-applications/influenza-phylogenetics/ using MAFFT for sequence alignment and FastTree for building the phylogenic tree.
Figure 4:
Figure 4:
Phylogenic analysis as radial representation computed on https://www.gisaid.org/epiflu-applications/influenza-phylogenetics/ using MAFFT for sequence alignment and FastTree for building the phylogenic tree.

Update of

Similar articles

Cited by

References

    1. Elbe S and Buckland-Merrett G (2017). Data, disease and diplomacy: GISAID’s innovative contribution to global health. Global Challenges, 1(33–46). - PMC - PubMed
    1. Freunde of GISAID, e. (2020). Global Initiative on Sharing All Influenza Data – TreeTool App.
    1. Hahn G, Cho MH, Weiss ST, Silverman EK, and Lange C (2020a). Unsupervised cluster analysis of SARS-CoV-2 genomes indicates that recent (June 2020) cases in Beijing are from a genetic subgroup that consists of mostly European and South (east) Asian samples, of which the latter are the most recent. bioRxiv:2020.06.22.165936.
    1. Hahn G, Lutz S, Hecker J, Prokopenko D, Cho M, Silverman E, Weiss S, and Lange C (2020b). locstra: Fast analysis of regional/global stratification in whole genome sequencing (wgs) studies. Accepted for publication with Genetic Epidemiology. Preprint at bioRxiv:2020.03.06.981050. - PMC - PubMed
    1. Hahn G, Lutz S, and Lange C (2020c). locStra: Fast Implementation of (Local) Population Stratification Methods (v1.3) https://cran.r-project.org/web/packages/locStra/index.html.

Publication types

MeSH terms

LinkOut - more resources