Unsupervised cluster analysis of SARS-CoV-2 genomes reflects its geographic progression and identifies distinct genetic subgroups of SARS-CoV-2 virus
- PMID: 33415739
- PMCID: PMC8005425
- DOI: 10.1002/gepi.22373
Unsupervised cluster analysis of SARS-CoV-2 genomes reflects its geographic progression and identifies distinct genetic subgroups of SARS-CoV-2 virus
Abstract
Over 10,000 viral genome sequences of the SARS-CoV-2virus have been made readily available during the ongoing coronavirus pandemic since the initial genome sequence of the virus was released on the open access Virological website (http://virological.org/) early on January 11. We utilize the published data on the single stranded RNAs of 11,132 SARS-CoV-2 patients in the GISAID database, which contains fully or partially sequenced SARS-CoV-2 samples from laboratories around the world. Among many important research questions which are currently being investigated, one aspect pertains to the genetic characterization/classification of the virus. We analyze data on the nucleotide sequencing of the virus and geographic information of a subset of 7640 SARS-CoV-2 patients without missing entries that are available in the GISAID database. Instead of modeling the mutation rate, applying phylogenetic tree approaches, and so forth, we here utilize a model-free clustering approach that compares the viruses at a genome-wide level. We apply principal component analysis to a similarity matrix that compares all pairs of these SARS-CoV-2 nucleotide sequences at all loci simultaneously, using the Jaccard index. Our analysis results of the SARS-CoV-2 genome data illustrates the geographic and chronological progression of the virus, starting from the first cases that were observed in China to the current wave of cases in Europe and North America. This is in line with a phylogenetic analysis which we use to contrast our results. We also observe that, based on their sequence data, the SARS-CoV-2 viruses cluster in distinct genetic subgroups. It is the subject of ongoing research to examine whether the genetic subgroup could be related to diseases outcome and its potential implications for vaccine development.
Keywords: SARS-CoV-2; clustering; covid; jaccard.
© 2021 Wiley Periodicals LLC.
Conflict of interest statement
Conflict of Interest
The authors declare no conflict of interest.
Figures




Update of
-
Unsupervised cluster analysis of SARS-CoV-2 genomes reflects its geographic progression and identifies distinct genetic subgroups of SARS-CoV-2 virus.bioRxiv [Preprint]. 2020 Nov 20:2020.05.05.079061. doi: 10.1101/2020.05.05.079061. bioRxiv. 2020. Update in: Genet Epidemiol. 2021 Apr;45(3):316-323. doi: 10.1002/gepi.22373. PMID: 32637949 Free PMC article. Updated. Preprint.
Similar articles
-
Unsupervised cluster analysis of SARS-CoV-2 genomes reflects its geographic progression and identifies distinct genetic subgroups of SARS-CoV-2 virus.bioRxiv [Preprint]. 2020 Nov 20:2020.05.05.079061. doi: 10.1101/2020.05.05.079061. bioRxiv. 2020. Update in: Genet Epidemiol. 2021 Apr;45(3):316-323. doi: 10.1002/gepi.22373. PMID: 32637949 Free PMC article. Updated. Preprint.
-
Phylogenetic reconstruction of the initial stages of the spread of the SARS-CoV-2 virus in the Eurasian and American continents by analyzing genomic data.Virus Res. 2021 Nov;305:198551. doi: 10.1016/j.virusres.2021.198551. Epub 2021 Aug 26. Virus Res. 2021. PMID: 34454972 Free PMC article.
-
Genetic grouping of SARS-CoV-2 coronavirus sequences using informative subtype markers for pandemic spread visualization.PLoS Comput Biol. 2020 Sep 17;16(9):e1008269. doi: 10.1371/journal.pcbi.1008269. eCollection 2020 Sep. PLoS Comput Biol. 2020. PMID: 32941419 Free PMC article.
-
Evolutionary study of COVID-19, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) as an emerging coronavirus: Phylogenetic analysis and literature review.Vet Med Sci. 2021 Mar;7(2):559-571. doi: 10.1002/vms3.394. Epub 2020 Nov 18. Vet Med Sci. 2021. PMID: 33210477 Free PMC article. Review.
-
The emergence, genomic diversity and global spread of SARS-CoV-2.Nature. 2021 Dec;600(7889):408-418. doi: 10.1038/s41586-021-04188-6. Epub 2021 Dec 8. Nature. 2021. PMID: 34880490 Review.
Cited by
-
Unsupervised outlier detection applied to SARS-CoV-2 nucleotide sequences can identify sequences of common variants and other variants of interest.BMC Bioinformatics. 2022 Dec 19;23(1):547. doi: 10.1186/s12859-022-05105-y. BMC Bioinformatics. 2022. PMID: 36536276 Free PMC article.
-
COVID-19: Integrating genomic and epidemiological data to inform public health interventions and policy in Tasmania, Australia.Western Pac Surveill Response J. 2021 Dec 22;12(4):1-9. doi: 10.5365/wpsar.2021.12.4.878. eCollection 2021 Oct-Dec. Western Pac Surveill Response J. 2021. PMID: 35251740 Free PMC article.
-
Genomic and structural mechanistic insight to reveal the differential infectivity of omicron and other variants of concern.Comput Biol Med. 2022 Nov;150:106129. doi: 10.1016/j.compbiomed.2022.106129. Epub 2022 Sep 22. Comput Biol Med. 2022. PMID: 36195045 Free PMC article.
-
Genome-wide analysis of 10664 SARS-CoV-2 genomes to identify virus strains in 73 countries based on single nucleotide polymorphism.Virus Res. 2021 Jun;298:198401. doi: 10.1016/j.virusres.2021.198401. Epub 2021 Mar 26. Virus Res. 2021. PMID: 33781798 Free PMC article.
-
Genome-wide association analysis of COVID-19 mortality risk in SARS-CoV-2 genomes identifies mutation in the SARS-CoV-2 spike protein that colocalizes with P.1 of the Brazilian strain.Genet Epidemiol. 2021 Oct;45(7):685-693. doi: 10.1002/gepi.22421. Epub 2021 Jun 22. Genet Epidemiol. 2021. PMID: 34159627 Free PMC article.
References
-
- Freunde of GISAID, e. (2020). Global Initiative on Sharing All Influenza Data – TreeTool App.
-
- Hahn G, Cho MH, Weiss ST, Silverman EK, and Lange C (2020a). Unsupervised cluster analysis of SARS-CoV-2 genomes indicates that recent (June 2020) cases in Beijing are from a genetic subgroup that consists of mostly European and South (east) Asian samples, of which the latter are the most recent. bioRxiv:2020.06.22.165936.
-
- Hahn G, Lutz S, and Lange C (2020c). locStra: Fast Implementation of (Local) Population Stratification Methods (v1.3) https://cran.r-project.org/web/packages/locStra/index.html.
Publication types
MeSH terms
Grants and funding
- U01HL089897/HL/NHLBI NIH HHS/United States
- P01HL132825/HL/NHLBI NIH HHS/United States
- R01 AI154470/AI/NIAID NIH HHS/United States
- R01HG008976/HG/NHGRI NIH HHS/United States
- P01HL120839/HL/NHLBI NIH HHS/United States
- R01 HG008976/HG/NHGRI NIH HHS/United States
- Cure Alzheimer's Fund
- 2U01HG008685/HL/NHLBI NIH HHS/United States
- U01 HG008685/HG/NHGRI NIH HHS/United States
- P01 HL132825/HL/NHLBI NIH HHS/United States
- U01HL089856/HL/NHLBI NIH HHS/United States
- U01 HL089897/HL/NHLBI NIH HHS/United States
- U01 HL089856/HL/NHLBI NIH HHS/United States
- National Institutes of Health: 1R01AI154470-01; 2U01HG008685
- P01 HL120839/HL/NHLBI NIH HHS/United States
LinkOut - more resources
Full Text Sources
Other Literature Sources
Medical
Miscellaneous