Comparison of Compression-Based Measures with Application to the Evolution of Primate Genomes
- PMID: 33265483
- PMCID: PMC7512912
- DOI: 10.3390/e20060393
Comparison of Compression-Based Measures with Application to the Evolution of Primate Genomes
Abstract
An efficient DNA compressor furnishes an approximation to measure and compare information quantities present in, between and across DNA sequences, regardless of the characteristics of the sources. In this paper, we compare directly two information measures, the Normalized Compression Distance (NCD) and the Normalized Relative Compression (NRC). These measures answer different questions; the NCD measures how similar both strings are (in terms of information content) and the NRC (which, in general, is nonsymmetric) indicates the fraction of one of them that cannot be constructed using information from the other one. This leads to the problem of finding out which measure (or question) is more suitable for the answer we need. For computing both, we use a state of the art DNA sequence compressor that we benchmark with some top compressors in different compression modes. Then, we apply the compressor on DNA sequences with different scales and natures, first using synthetic sequences and then on real DNA sequences. The last include mitochondrial DNA (mtDNA), messenger RNA (mRNA) and genomic DNA (gDNA) of seven primates. We provide several insights into evolutionary acceleration rates at different scales, namely, the observation and confirmation across the whole genomes of a higher variation rate of the mtDNA relative to the gDNA. We also show the importance of relative compression for localizing similar information regions using mtDNA.
Keywords: DNA sequences; NCD; NRC; data compression; primate evolution.
Conflict of interest statement
The authors declare no conflict of interest.
Figures
Similar articles
-
Efficient DNA sequence compression with neural networks.Gigascience. 2020 Nov 11;9(11):giaa119. doi: 10.1093/gigascience/giaa119. Gigascience. 2020. PMID: 33179040 Free PMC article.
-
Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment.BMC Bioinformatics. 2007 Jul 13;8:252. doi: 10.1186/1471-2105-8-252. BMC Bioinformatics. 2007. PMID: 17629909 Free PMC article.
-
Sequence Compression Benchmark (SCB) database-A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences.Gigascience. 2020 Jul 1;9(7):giaa072. doi: 10.1093/gigascience/giaa072. Gigascience. 2020. PMID: 32627830 Free PMC article.
-
CoGI: Towards Compressing Genomes as an Image.IEEE/ACM Trans Comput Biol Bioinform. 2015 Nov-Dec;12(6):1275-85. doi: 10.1109/TCBB.2015.2430331. IEEE/ACM Trans Comput Biol Bioinform. 2015. PMID: 26671800
-
Dark Matter of Primate Genomes: Satellite DNA Repeats and Their Evolutionary Dynamics.Cells. 2020 Dec 18;9(12):2714. doi: 10.3390/cells9122714. Cells. 2020. PMID: 33352976 Free PMC article. Review.
Cited by
-
Metagenomic Composition Analysis of an Ancient Sequenced Polar Bear Jawbone from Svalbard.Genes (Basel). 2018 Sep 6;9(9):445. doi: 10.3390/genes9090445. Genes (Basel). 2018. PMID: 30200636 Free PMC article.
-
Visual Analysis of Research Paper Collections Using Normalized Relative Compression.Entropy (Basel). 2019 Jun 21;21(6):612. doi: 10.3390/e21060612. Entropy (Basel). 2019. PMID: 33267326 Free PMC article.
-
BiComp-DTA: Drug-target binding affinity prediction through complementary biological-related and compression-based featurization approach.PLoS Comput Biol. 2023 Mar 31;19(3):e1011036. doi: 10.1371/journal.pcbi.1011036. eCollection 2023 Mar. PLoS Comput Biol. 2023. PMID: 37000857 Free PMC article.
References
-
- Kolmogorov A.N. Three approaches to the quantitative definition of information. Probl. Inf. Transm. 1965;1:1–7. doi: 10.1080/00207166808803030. - DOI
-
- Niven R.K. Combinatorial entropies and statistics. Eur. Phys. J. B. 2009;70:49–63. doi: 10.1140/epjb/e2009-00168-5. - DOI
-
- Mantaci S., Restivo A., Rosone G., Sciortino M. A new combinatorial approach to sequence comparison. Theory Comput. Syst. 2008;42:411–429. doi: 10.1007/s00224-007-9078-6. - DOI
-
- Shannon C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948;27:379–423, 623–656. doi: 10.1002/j.1538-7305.1948.tb01338.x. - DOI
-
- Solomonoff R.J. A formal theory of inductive inference. Part I. Inf. Control. 1964;7:1–22. doi: 10.1016/S0019-9958(64)90223-2. - DOI
Grants and funding
LinkOut - more resources
Full Text Sources
