Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Nov;28(11):1113-1129.
doi: 10.1089/cmb.2021.0302. Epub 2021 Oct 25.

From Alpha to Zeta: Identifying Variants and Subtypes of SARS-CoV-2 Via Clustering

Affiliations

From Alpha to Zeta: Identifying Variants and Subtypes of SARS-CoV-2 Via Clustering

Andrew Melnyk et al. J Comput Biol. 2021 Nov.

Abstract

The availability of millions of SARS-CoV-2 (Severe Acute Respiratory Syndrome-Coronavirus-2) sequences in public databases such as GISAID (Global Initiative on Sharing All Influenza Data) and EMBL-EBI (European Molecular Biology Laboratory-European Bioinformatics Institute) (the United Kingdom) allows a detailed study of the evolution, genomic diversity, and dynamics of a virus such as never before. Here, we identify novel variants and subtypes of SARS-CoV-2 by clustering sequences in adapting methods originally designed for haplotyping intrahost viral populations. We asses our results using clustering entropy-the first time it has been used in this context. Our clustering approach reaches lower entropies compared with other methods, and we are able to boost this even further through gap filling and Monte Carlo-based entropy minimization. Moreover, our method clearly identifies the well-known Alpha variant in the U.K. and GISAID data sets, and is also able to detect the much less represented (<1% of the sequences) Beta (South Africa), Epsilon (California), and Gamma and Zeta (Brazil) variants in the GISAID data set. Finally, we show that each variant identified has high selective fitness, based on the growth rate of its cluster over time. This demonstrates that our clustering approach is a viable alternative for detecting even rare subtypes in very large data sets.

Keywords: clustering; entropy; fitness; genomic surveillance; viral subtypes; viral variants.

PubMed Disclaimer

Conflict of interest statement

The authors declare they have no conflicting financial interests.

Figures

FIG. 1.
FIG. 1.
Subtype distribution (GISAID data set, 15-day window, relative count).
FIG. 2.
FIG. 2.
Subtype distribution (GISAID data set, cumulative, relative count).
FIG. 3.
FIG. 3.
The entropy descent of our Monte Carlo method applied to the initial clustering obtained by CliqueSNV-based clustering of the GISAID 1A data set after having preprocessed to (a) n=28,000 tags and (b) n=1000 tags. Note that in the latter table, the entropy is in terms of just the 1000 tags—the optimal clustering in terms of these 1000 tags then applied to the original set of all columns, for the final entropy 945.9 seen in Table 6. Note that a threshold of θ=1000 (see Section 2.4) was used in both cases.
FIG. 4.
FIG. 4.
Subtype distribution (the U.K. data set, weekly window, relative count) produced our CliqueSNV-based clustering method. The subtype in the bottom right corner contributes to sequences that correspond to the Alpha variant.

Similar articles

Cited by

References

    1. Ahn, S., and Vikalo, H.. 2018. abayesqr: A bayesian method for reconstruction of viral populations characterized by low diversity. J. Comput. Biol. 25, 637–648. - PubMed
    1. Anderberg, M.R. 1973. Cluster Analysis for Applications. Academic Press, New York, NY.
    1. Baaijens, J., Aabidine, A., Rivals, E., et al. . 2017. De novo assembly of viral quasispecies using overlap graphs. Gen. Res. 27, 835–848. - PMC - PubMed
    1. Bukhari, Q., Jameel, Y., Massaro, J.M., et al. . 2020. Periodic oscillations in daily reported infections and deaths for coronavirus disease 2019. JAMA Netw. Open. 3, e2017521. - PMC - PubMed
    1. Caliński, T., and Harabasz, J.. 1974. A dendrite method for cluster analysis. Commun. Stat. 3, 1–27.

Publication types

LinkOut - more resources