. 2022 Aug 11:11:giac079.

doi: 10.1093/gigascience/giac079.

The complexity landscape of viral genomes

Jorge Miguel Silva¹, Diogo Pratas^{1

2

3}, Tânia Caetano⁴, Sérgio Matos^{1

2}

Affiliations

¹ Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal.
² Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitario de Santiago, 3810-193 Aveiro, Portugal.
³ Department of Virology, University of Helsinki, Haartmaninkatu 3, 00014 Helsinki, Finland.
⁴ Department of Biology, University of Aveiro, Campus Universitario de Santiago, 3810-193 Aveiro, Portugal.

PMID: 35950839
PMCID: PMC9366995
DOI: 10.1093/gigascience/giac079

The complexity landscape of viral genomes

Jorge Miguel Silva et al. Gigascience. 2022.

. 2022 Aug 11:11:giac079.

doi: 10.1093/gigascience/giac079.

Authors

Jorge Miguel Silva¹, Diogo Pratas^{1

2

3}, Tânia Caetano⁴, Sérgio Matos^{1

2}

Affiliations

¹ Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal.
² Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitario de Santiago, 3810-193 Aveiro, Portugal.
³ Department of Virology, University of Helsinki, Haartmaninkatu 3, 00014 Helsinki, Finland.
⁴ Department of Biology, University of Aveiro, Campus Universitario de Santiago, 3810-193 Aveiro, Portugal.

PMID: 35950839
PMCID: PMC9366995
DOI: 10.1093/gigascience/giac079

Abstract

Background: Viruses are among the shortest yet highly abundant species that harbor minimal instructions to infect cells, adapt, multiply, and exist. However, with the current substantial availability of viral genome sequences, the scientific repertory lacks a complexity landscape that automatically enlights viral genomes' organization, relation, and fundamental characteristics.

Results: This work provides a comprehensive landscape of the viral genome's complexity (or quantity of information), identifying the most redundant and complex groups regarding their genome sequence while providing their distribution and characteristics at a large and local scale. Moreover, we identify and quantify inverted repeats abundance in viral genomes. For this purpose, we measure the sequence complexity of each available viral genome using data compression, demonstrating that adequate data compressors can efficiently quantify the complexity of viral genome sequences, including subsequences better represented by algorithmic sources (e.g., inverted repeats). Using a state-of-the-art genomic compressor on an extensive viral genomes database, we show that double-stranded DNA viruses are, on average, the most redundant viruses while single-stranded DNA viruses are the least. Contrarily, double-stranded RNA viruses show a lower redundancy relative to single-stranded RNA. Furthermore, we extend the ability of data compressors to quantify local complexity (or information content) in viral genomes using complexity profiles, unprecedently providing a direct complexity analysis of human herpesviruses. We also conceive a features-based classification methodology that can accurately distinguish viral genomes at different taxonomic levels without direct comparisons between sequences. This methodology combines data compression with simple measures such as GC-content percentage and sequence length, followed by machine learning classifiers.

Conclusions: This article presents methodologies and findings that are highly relevant for understanding the patterns of similarity and singularity between viral groups, opening new frontiers for studying viral genomes' organization while depicting the complexity trends and classification components of these genomes at different taxonomic levels. The whole study is supported by an extensive website (https://asilab.github.io/canvas/) for comprehending the viral genome characterization using dynamic and interactive approaches.

Keywords: algorithmic information theory; cladograms; data compression; genomics; sequence analysis; viral classification; viruses.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Figure 1**
: Selection of a level for GeCo3 from a pool of 19 levels. (A) Frequency where each level provided the best NC results. (B) The sum for each level of the NC from the compression of all reference genomes. For better visualization, please visit the website https://asilab.github.io/canvas/.

**Figure 2**
: Plot describing the variation of normalized compression (NC) and normalized block decomposition method (NBDM) with an increase of mutation rate of a sequence (0–10%). The NC was computed using the state-of-the-art genomic compressor (GeCo3 [84]) and a general-purpose compressor (PAQ8 [104]). The NBDM (red line), the NC value using cmix (brown line), and PAQ8 (purple line) are depicted. Furthermore, the GeCo3 compressor with (IR₂) and without (IR₀) the IR detection subprogram is shown with orange and blue lines, respectively. Finally, the green line shows the difference between .

formula image — **Figure 2**
: Plot describing the variation of normalized compression (NC) and normalized block decomposition method (NBDM) with an increase of mutation rate of a sequence (0–10%). The NC was computed using the state-of-the-art genomic compressor (GeCo3 [84]) and a general-purpose compressor (PAQ8 [104]). The NBDM (red line), the NC value using cmix (brown line), and PAQ8 (purple line) are depicted. Furthermore, the GeCo3 compressor with (IR₂) and without (IR₀) the IR detection subprogram is shown with orange and blue lines, respectively. Finally, the green line shows the difference between .

**Figure 3**
: Average normalized compression (ANC) and average sequence length per viral group. The values were obtained for genome type (A) and realm (B). To view all boxplots by groups of realm, kingdom, phylum, class, order, family, and genus, please visit the website https://asilab.github.io/canvas/.

**Figure 4**
: Average normalized compression (ANC) and average sequence length per the genera of the Herpesviridae family (A) and for various human herpesviruses (B). In the boxplot where the genera of the Herpesviridae family are displayed, 2 genera were selected, one with a low level of inverted repeats (*Lymphocryptovirus*) and one with a high level (*Mardivirus*). Then, a representative reference sequence was selected (*Lymphocryptovirus*—human herpesvirus 4 or Epstein–Barr virus, NCBI Reference Sequence: ; *Mardivirus*—Falconid herpesvirus 1 strain S-18, NCBI Reference Sequence: ) and minimal bidirectional complexity profiles were created (C).

**Figure 5**
: Cladograms showing average normalized compression (NC) of each viral group (A) and the normalized compression capacity (NCC) (B). NCC results were obtained by . The red color depicts the highest complexity and the blue the lowest. The first cladogram describes the NC of each taxonomic branch. Red color shows genomes with less redundancy and blue ones with more redundancy. On the other hand, the second cladogram depicts the prevalence of inverted repeats on each taxonomic branch. Red indicates branches with genomes with a high percentage of inverted repeats, whereas blue shows branches with a low percentage. For better visualization, please visit the website https://asilab.github.io/canvas/.

**Figure 6**
: Scatterplots of normalized compression versus sequence length and GC-content (A), scatterplots of normalized compression versus sequence length (B), and normalized compression versus GC-content (C).

See this image and copyright information in PMC

Cited by

Herpesviruses: overview of systematics, genomic complexity and life cycle.
Dotto-Maurel A, Arzul I, Morga B, Chevignon G. Dotto-Maurel A, et al. Virol J. 2025 May 22;22(1):155. doi: 10.1186/s12985-025-02779-7. Virol J. 2025. PMID: 40399963 Free PMC article. Review.
Hecatomb: an integrated software platform for viral metagenomics.
Roach MJ, Beecroft SJ, Mihindukulasuriya KA, Wang L, Paredes A, Cárdenas LAC, Henry-Cocks K, Lima LFO, Dinsdale EA, Edwards RA, Handley SA. Roach MJ, et al. Gigascience. 2024 Jan 2;13:giae020. doi: 10.1093/gigascience/giae020. Gigascience. 2024. PMID: 38832467 Free PMC article.
Genomic Insights into Neglected Orthobunyaviruses: Molecular Characterization and Phylogenetic Analysis.
Sankhe S, Dieng I, Kane M, Diallo A, Ndiaye NA, Top NM, Dia M, Faye O, Sall AA, Faye O, Sembene PM, Loucoubar C, Faye M, Diagne MM. Sankhe S, et al. Viruses. 2025 Mar 13;17(3):406. doi: 10.3390/v17030406. Viruses. 2025. PMID: 40143333 Free PMC article.
Temperature modulates dominance of a superinfecting Arctic virus in its unicellular algal host.
Meyer C, Jackson VLN, Harrison K, Fouskari I, Bolhuis H, Artzy-Randrup YA, Huisman J, Monier A, Brussaard CPD. Meyer C, et al. ISME J. 2024 Jan 8;18(1):wrae161. doi: 10.1093/ismejo/wrae161. ISME J. 2024. PMID: 39173010 Free PMC article.

References

1. Hendrix RW, Hatfull GF, Ford ME, et al. Evolutionary relationships among diverse bacteriophages and prophages: all the world’s a phage. In: Syvanen M, Kado CI, eds. Horizontal gene transfer. New York: Elsevier; 2002. p. 133–VI. - PMC - PubMed
1. O’Leary NA, Wright MW, Brister JR, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44(D1):D733–45. - PMC - PubMed
1. Edwards RA, Rohwer F. Viral metagenomics. Nat Rev Microbiol. 2005;3(6):504–10. - PubMed
1. Lawrence CM, Menon S, Eilers BJ, et al. Structural and functional studies of archaeal viruses. J Biol Chem. 2009;284(19):12599–603. - PMC - PubMed
1. Koonin EV, Senkevich TG, Dolja VV. The ancient Virus World and evolution of cells. Biol Direct. 2006;1(1):29. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The complexity landscape of viral genomes

Affiliations

The complexity landscape of viral genomes

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Miscellaneous