Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Apr 26;11(1):veaf029.
doi: 10.1093/ve/veaf029. eCollection 2025.

Similarity-weighted entropy for quantifying genetic diversity in viral quasispecies

Affiliations

Similarity-weighted entropy for quantifying genetic diversity in viral quasispecies

Jian Wu. Virus Evol. .

Abstract

A viral quasispecies is a genetically diverse population of closely related viral variants that exist in a state of dynamic equilibrium. This diversity, driven by mutations, recombination, and selective pressures, enables viruses to adapt rapidly, affecting pathogenicity and treatment resistance. Quantifying the genetic diversity within viral quasispecies is therefore crucial for understanding viral evolution and for designing effective therapeutic strategies. Entropy is a commonly used metric to measure genetic diversity within such populations; however, traditional entropy calculations often neglect genetic similarities between sequences, which can result in overestimating true diversity. In this study, I compare several widely used diversity indices for quantifying viral quasispecies diversity and introduce a novel similarity-weighted entropy metric that incorporates sequence similarity into entropy calculations. This approach enables a more comprehensive representation of diversity in genetically cohesive viral populations. By applying both conventional and similarity-weighted entropy calculations to hypothetical sequence populations and real viroid and virus quasispecies, I demonstrate that similarity-weighted entropy provides a more comprehensive measure of genetic diversity while maintaining the simplicity of conventional entropy. These findings highlight the value of similarity-weighted entropy in characterizing viral quasispecies and its potential to improve our understanding of viral adaptation and resistance mechanisms.

Keywords: Viral quasispecies; entropy; genetic diversity; sequence similarity.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
Rationale for including sequence similarity as a weight in entropy calculations to reflect genetic diversity in viral quasispecies. (a) The curve for hi = −fI*log2(fi) illustrates how entropy is influenced by the frequency of each unique sequence i (fi) in a population. The entropy H is the sum of the individual entropies hi. The peak occurs at approximately fi = 1/e ≈ 0.3679 and decreases as fi approaches either 0 or 1. The plot helps visualize the balance between diversity (represented by nonzero frequencies across many unique sequences) and uniformity (where high frequencies for a few sequences reduce entropy). (b) Analysing sequence similarity in two hypothetical viral quasispecies reveals limitations of traditional entropy in capturing genetic diversity. Standard entropy metrics focus on the distribution of variant frequencies, disregarding genetic relationships between sequences. In both hypothetical quasispecies 1 and 2, a master sequence accounts for 90% of the population, with four variants making up 4%, 4%, 1%, and 1%, respectively. This distribution yields identical entropy-based diversity values for both quasispecies. However, in reality, the variants in quasispecies 2 contain more mutations, indicated by the red solid triangles, resulting in a higher actual genetic diversity compared to quasispecies 1. (c) The principle of incorporating sequence similarity as a weight in entropy calculation can be illustrated using a master sequence, represented by the largest red solid circle in the centre. This diagram helps to explain the relationship between sequence diversity, genetic distance (sequence similarity), and weighting in the calculation. Viral quasispecies 2 presented in (b) are used as an example. Two dashed circles indicate levels of genetic distance: the larger dashed circle represents greater distance, while the smaller one indicates closer similarity. The size of each red circle corresponds to sequence frequency. The thickness of lines connecting the master sequence to its variants represents the weights added to entropy, with greater genetic distance and higher variant frequency contributing more to genetic diversity.
Figure 2.
Figure 2.
Limitations of conventional diversity indices in analysing viral quasispecies with similar variant frequencies but different sequence similarities. Four entropy-based metrics—H, Hn, Hsim, and Hnsim—were used to evaluate their effectiveness in describing viral quasispecies diversity across varying sequence similarity levels. Viral quasispecies typically consist of a dominant master sequence with medium- to low-frequency variants that share high sequence similarity, with values typically above 80%, as lower similarity could indicate the emergence of distinct viral species. To simulate realistic quasispecies populations, three distributions of 10 sequences with different frequency profiles were designed: (a) Distribution 1, which includes a high-frequency master sequence, a few medium-frequency variants, and several low-frequency variants, reflecting a common structure in viral quasispecies; (b) Distribution 2, similar to Distribution 1, but with the master sequence at a slightly lower frequency and the remaining frequencies more evenly distributed among the medium- and low-frequency variants, mimicking viral populations with a more even variant diversity; and (c) Distribution 3, a uniform distribution in which all variants have equal frequency, representing a maximally diverse population without a dominant master sequence. For each distribution, H, Hn, Hsim, and Hnsim were calculated across a range of sequence similarity values (from 0.999 to 0) on the x-axis, with sequence frequencies (S1–S10) also presented. Dashed-line rectangles indicated the sequence similarity values typically observed within viral quasispecies.
Figure 3.
Figure 3.
Entropy metrics across varying sequence distributions at different fixed sequence similarities. Four entropy calculation methods (H, Hn, Hsim, and Hnsim) were tested across nine distributions from D-1 to D-9 (a) at fixed sequence similarities of 0.95 (b), 0.90 (c), and 0.80 (d). On the x-axis, the sequence distributions ranged from D-1, where a single dominant sequence (master) prevails in the population, to D-9, where all sequences have more balanced and higher frequencies, creating a more diverse population.
Figure 4.
Figure 4.
Analysis of four entropy metrics (H, Hn, Hsim, and Hnsim) using sequence populations with saturated mutations and uniform sequence frequencies. Sequence populations were generated with fully saturated mutations for N (4 sequences), NN (16 sequences), NNN (64 sequences), up to 7 Ns (16,384 sequences), all with equal sequence frequencies, forming uniform distributions. For clarity, H and Hn are presented in panel (a), while Hsim and Hnsim are displayed in panel (b). X-axis represents the lengths of sequences in each population.
Figure 5.
Figure 5.
Analysis of a real viroid quasispecies demonstrated the advantage of similarity-weighted entropy indices in capturing functional diversity. (a) Schematic representation of the mutagenesis strategy applied to the PSTVd loop 27 region, spanning nucleotides 177–182. The wild-type sequence, UUUUCA, forms a hairpin structure with a four-nucleotide loop (UUUC) and a UA base pair that closes the loop. The wild-type sequence was mutated to UNNNNA, where each N represents one of four nucleotides (A, U, G, or C), generating 256 unique sequence variants. (b) A pairwise similarity matrix for the 256 variants was generated and summarized, showing those similarity values of 0.75, 0.50, 0.25, and 0.00 account for 4.71%, 21.18%, 0.25%, and 0.00% of all pairs, respectively. (c) Comparison of entropy metrics (H, Hn, Hsim, and Hnsim) across the theoretical mutant pool (with equal frequencies for all 256 sequences), the practical pool generated by saturated mutagenesis, and the evolved pool after replication in the inoculated region, migration to leaf margins (LM), and trafficking to systemic leaves (Sys samples) within the plant. (d) Structural compatibility of each mutant sequence relative to the wild-type PSTVd loop 27, grouped by mismatch count (M1–M4). The JAR3D tool (https://rna.bgsu.edu/jar3d/) was utilized to align the mutant sequences with the 3D structural model of the wild-type PSTVd loop 27 (UUUCA). The resulting Cutoff Score reflects the compatibility between each mutant sequence and the wild-type model. Higher Cutoff Scores indicate stronger structural similarity to the wild-type, and a decrease in score correlates with an increase in mismatches. The Cutoff Scores were presented as the mean ± standard error. (e) Structural comparison of selected mutants (UGGAAA and UGAAAA) and the wild-type sequence (UUUCA). JAR3D was used to predict the structural models for both wild-type and mutant sequences. Models with identical sequences were identified. Mutants UGGAAA and UGAAAA form distinct structures from the wild-type, potentially indicating functional diversification.
Figure 6.
Figure 6.
Sequence diversity of ToBRFV in tm-2 and Tm-22 plants revealed by four entropy metrics. (a) Genome organization of ToBRFV. ToBRFV encodes two key proteins for RNA replication: a 126 kDa protein and a 183 kDa protein, the latter produced via ribosomal readthrough of the 126 kDa protein’s termination codon. ToBRFV also encodes a MP and a CP. The 5′ UTR acts as a translational enhancer, while the 3′ UTR increases mRNA stability. This study focuses on 73 nucleotides (nts) of the 5′ UTR and the first 47 nts of the gene encoding the 126 kDa protein, a region totalling 120 nts, which I named 5′-120nts. (b) Variants of the 5′-120nts region in ToBRFV-infected tm-2 and Tm-22 plants were sequenced using single-cell RNA sequencing performed with the 10× Genomics Chromium system. Unique sequences were identified, and the number of reads for each sequence was counted. The abundance (percentage) of each unique sequence was then calculated. To exclude errors introduced during library preparation and sequencing, sequences with an abundance below 0.1% of the total reads were excluded. As a result, a total of 48 and 47 unique sequences were identified for the tm-2 and Tm-22 samples, respectively. The abundance of each unique sequence in the two samples is presented. (c) The sequences of each unique variant were compared to the master sequence (Seq1), and the variants were classified based on the number of mutations. Variants were categorized as single, double, triple, quadruple, quintuple, and sextuple mutants. (d) H, Hn, Hsim, and Hnsim were calculated for the quasispecies of the 5′-120nts region identified from the tm-2 and Tm-22 samples. These values were then compared between the two samples.

Similar articles

References

    1. Aparicio G, Lavinia J, López CB. A virus is a community: diversity within negative-sense RNA virus populations. Microbiol Mol Biol Rev 2022;86:e00086-21. doi: 10.1128/mmbr.00086-21 - DOI - PMC - PubMed
    1. Arbiza J, Mirazo S, Fort H. Viral quasispecies profiles as the result of the interplay of competition and cooperation. BMC Evol Biol 2010;10:1–9. doi: 10.1186/1471-2148-10-137 - DOI - PMC - PubMed
    1. Chang G, Wang T. Weighted relative entropy for alignment-free sequence comparison based on Markov model. J Biomol Struct Dyn 2011;28:545–55. doi: 10.1080/07391102.2011.10508594 - DOI - PubMed
    1. Domingo E, García-Crespo C, Perales C. Historical perspective on the discovery of the quasispecies concept. Annu Rev Virol 2021;8:51–72. doi: 10.1146/annurev-virology-091919-105900 - DOI - PubMed
    1. Domingo E, Perales C. Viral quasispecies. PLoS Genetics 2019;15:e1008271. doi: 10.1371/journal.pgen.1008271 - DOI - PMC - PubMed

LinkOut - more resources