Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Aug 11:11:giac079.
doi: 10.1093/gigascience/giac079.

The complexity landscape of viral genomes

Affiliations

The complexity landscape of viral genomes

Jorge Miguel Silva et al. Gigascience. .

Abstract

Background: Viruses are among the shortest yet highly abundant species that harbor minimal instructions to infect cells, adapt, multiply, and exist. However, with the current substantial availability of viral genome sequences, the scientific repertory lacks a complexity landscape that automatically enlights viral genomes' organization, relation, and fundamental characteristics.

Results: This work provides a comprehensive landscape of the viral genome's complexity (or quantity of information), identifying the most redundant and complex groups regarding their genome sequence while providing their distribution and characteristics at a large and local scale. Moreover, we identify and quantify inverted repeats abundance in viral genomes. For this purpose, we measure the sequence complexity of each available viral genome using data compression, demonstrating that adequate data compressors can efficiently quantify the complexity of viral genome sequences, including subsequences better represented by algorithmic sources (e.g., inverted repeats). Using a state-of-the-art genomic compressor on an extensive viral genomes database, we show that double-stranded DNA viruses are, on average, the most redundant viruses while single-stranded DNA viruses are the least. Contrarily, double-stranded RNA viruses show a lower redundancy relative to single-stranded RNA. Furthermore, we extend the ability of data compressors to quantify local complexity (or information content) in viral genomes using complexity profiles, unprecedently providing a direct complexity analysis of human herpesviruses. We also conceive a features-based classification methodology that can accurately distinguish viral genomes at different taxonomic levels without direct comparisons between sequences. This methodology combines data compression with simple measures such as GC-content percentage and sequence length, followed by machine learning classifiers.

Conclusions: This article presents methodologies and findings that are highly relevant for understanding the patterns of similarity and singularity between viral groups, opening new frontiers for studying viral genomes' organization while depicting the complexity trends and classification components of these genomes at different taxonomic levels. The whole study is supported by an extensive website (https://asilab.github.io/canvas/) for comprehending the viral genome characterization using dynamic and interactive approaches.

Keywords: algorithmic information theory; cladograms; data compression; genomics; sequence analysis; viral classification; viruses.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
: Selection of a level for GeCo3 from a pool of 19 levels. (A) Frequency where each level provided the best NC results. (B) The sum for each level of the NC from the compression of all reference genomes. For better visualization, please visit the website https://asilab.github.io/canvas/.
Figure 2
Figure 2
: Plot describing the variation of normalized compression (NC) and normalized block decomposition method (NBDM) with an increase of mutation rate of a sequence (0–10%). The NC was computed using the state-of-the-art genomic compressor (GeCo3 [84]) and a general-purpose compressor (PAQ8 [104]). The NBDM (red line), the NC value using cmix (brown line), and PAQ8 (purple line) are depicted. Furthermore, the GeCo3 compressor with (IR2) and without (IR0) the IR detection subprogram is shown with orange and blue lines, respectively. Finally, the green line shows the difference between formula image.
Figure 3
Figure 3
: Average normalized compression (ANC) and average sequence length per viral group. The values were obtained for genome type (A) and realm (B). To view all boxplots by groups of realm, kingdom, phylum, class, order, family, and genus, please visit the website https://asilab.github.io/canvas/.
Figure 4
Figure 4
: Average normalized compression (ANC) and average sequence length per the genera of the Herpesviridae family (A) and for various human herpesviruses (B). In the boxplot where the genera of the Herpesviridae family are displayed, 2 genera were selected, one with a low level of inverted repeats (Lymphocryptovirus) and one with a high level (Mardivirus). Then, a representative reference sequence was selected (Lymphocryptovirus—human herpesvirus 4 or Epstein–Barr virus, NCBI Reference Sequence: formula image; Mardivirus—Falconid herpesvirus 1 strain S-18, NCBI Reference Sequence: formula image) and minimal bidirectional complexity profiles were created (C).
Figure 5
Figure 5
: Cladograms showing average normalized compression (NC) of each viral group (A) and the normalized compression capacity (NCC) (B). NCC results were obtained by formula image. The red color depicts the highest complexity and the blue the lowest. The first cladogram describes the NC of each taxonomic branch. Red color shows genomes with less redundancy and blue ones with more redundancy. On the other hand, the second cladogram depicts the prevalence of inverted repeats on each taxonomic branch. Red indicates branches with genomes with a high percentage of inverted repeats, whereas blue shows branches with a low percentage. For better visualization, please visit the website https://asilab.github.io/canvas/.
Figure 6
Figure 6
: Scatterplots of normalized compression versus sequence length and GC-content (A), scatterplots of normalized compression versus sequence length (B), and normalized compression versus GC-content (C).

Similar articles

Cited by

References

    1. Hendrix RW, Hatfull GF, Ford ME, et al. Evolutionary relationships among diverse bacteriophages and prophages: all the world’s a phage. In: Syvanen M, Kado CI, eds. Horizontal gene transfer. New York: Elsevier; 2002. p. 133–VI. - PMC - PubMed
    1. O’Leary NA, Wright MW, Brister JR, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44(D1):D733–45. - PMC - PubMed
    1. Edwards RA, Rohwer F. Viral metagenomics. Nat Rev Microbiol. 2005;3(6):504–10. - PubMed
    1. Lawrence CM, Menon S, Eilers BJ, et al. Structural and functional studies of archaeal viruses. J Biol Chem. 2009;284(19):12599–603. - PMC - PubMed
    1. Koonin EV, Senkevich TG, Dolja VV. The ancient Virus World and evolution of cells. Biol Direct. 2006;1(1):29. - PMC - PubMed

Publication types