From Alpha to Zeta: Identifying Variants and Subtypes of SARS-CoV-2 Via Clustering
- PMID: 34698508
- PMCID: PMC8819513
- DOI: 10.1089/cmb.2021.0302
From Alpha to Zeta: Identifying Variants and Subtypes of SARS-CoV-2 Via Clustering
Abstract
The availability of millions of SARS-CoV-2 (Severe Acute Respiratory Syndrome-Coronavirus-2) sequences in public databases such as GISAID (Global Initiative on Sharing All Influenza Data) and EMBL-EBI (European Molecular Biology Laboratory-European Bioinformatics Institute) (the United Kingdom) allows a detailed study of the evolution, genomic diversity, and dynamics of a virus such as never before. Here, we identify novel variants and subtypes of SARS-CoV-2 by clustering sequences in adapting methods originally designed for haplotyping intrahost viral populations. We asses our results using clustering entropy-the first time it has been used in this context. Our clustering approach reaches lower entropies compared with other methods, and we are able to boost this even further through gap filling and Monte Carlo-based entropy minimization. Moreover, our method clearly identifies the well-known Alpha variant in the U.K. and GISAID data sets, and is also able to detect the much less represented (<1% of the sequences) Beta (South Africa), Epsilon (California), and Gamma and Zeta (Brazil) variants in the GISAID data set. Finally, we show that each variant identified has high selective fitness, based on the growth rate of its cluster over time. This demonstrates that our clustering approach is a viable alternative for detecting even rare subtypes in very large data sets.
Keywords: clustering; entropy; fitness; genomic surveillance; viral subtypes; viral variants.
Conflict of interest statement
The authors declare they have no conflicting financial interests.
Figures




Similar articles
-
Reconstruction of Viral Variants via Monte Carlo Clustering.J Comput Biol. 2023 Sep;30(9):1009-1018. doi: 10.1089/cmb.2023.0154. Epub 2023 Sep 11. J Comput Biol. 2023. PMID: 37695837 Free PMC article.
-
Genetic Surveillance of SARS-CoV-2 Mpro Reveals High Sequence and Structural Conservation Prior to the Introduction of Protease Inhibitor Paxlovid.mBio. 2022 Aug 30;13(4):e0086922. doi: 10.1128/mbio.00869-22. Epub 2022 Jul 13. mBio. 2022. PMID: 35862764 Free PMC article.
-
CoV-Seq, a New Tool for SARS-CoV-2 Genome Analysis and Visualization: Development and Usability Study.J Med Internet Res. 2020 Oct 2;22(10):e22299. doi: 10.2196/22299. J Med Internet Res. 2020. PMID: 32931441 Free PMC article.
-
Deep phylogenetic-based clustering analysis uncovers new and shared mutations in SARS-CoV-2 variants as a result of directional and convergent evolution.PLoS One. 2022 May 24;17(5):e0268389. doi: 10.1371/journal.pone.0268389. eCollection 2022. PLoS One. 2022. PMID: 35609034 Free PMC article.
-
Genomic landscape of the SARS-CoV-2 pandemic in Brazil suggests an external P.1 variant origin.Front Microbiol. 2022 Dec 22;13:1037455. doi: 10.3389/fmicb.2022.1037455. eCollection 2022. Front Microbiol. 2022. PMID: 36620039 Free PMC article.
Cited by
-
PWM2Vec: An Efficient Embedding Approach for Viral Host Specification from Coronavirus Spike Sequences.Biology (Basel). 2022 Mar 9;11(3):418. doi: 10.3390/biology11030418. Biology (Basel). 2022. PMID: 35336792 Free PMC article.
-
Reconstruction of Viral Variants via Monte Carlo Clustering.J Comput Biol. 2023 Sep;30(9):1009-1018. doi: 10.1089/cmb.2023.0154. Epub 2023 Sep 11. J Comput Biol. 2023. PMID: 37695837 Free PMC article.
-
Early detection of emerging viral variants through analysis of community structure of coordinated substitution networks.Nat Commun. 2024 Apr 2;15(1):2838. doi: 10.1038/s41467-024-47304-6. Nat Commun. 2024. PMID: 38565543 Free PMC article.
-
MeShClust v3.0: high-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores.BMC Genomics. 2022 Jun 6;23(1):423. doi: 10.1186/s12864-022-08619-0. BMC Genomics. 2022. PMID: 35668366 Free PMC article.
-
AutoCoV: tracking the early spread of COVID-19 in terms of the spatial and temporal patterns from embedding space by K-mer based deep learning.BMC Bioinformatics. 2022 Apr 25;23(Suppl 3):149. doi: 10.1186/s12859-022-04679-x. BMC Bioinformatics. 2022. PMID: 35468739 Free PMC article.
References
-
- Ahn, S., and Vikalo, H.. 2018. abayesqr: A bayesian method for reconstruction of viral populations characterized by low diversity. J. Comput. Biol. 25, 637–648. - PubMed
-
- Anderberg, M.R. 1973. Cluster Analysis for Applications. Academic Press, New York, NY.
-
- Caliński, T., and Harabasz, J.. 1974. A dendrite method for cluster analysis. Commun. Stat. 3, 1–27.
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Miscellaneous