Significantly lower entropy estimates for natural DNA sequences

D Loewenstern¹, P N Yianilos

Affiliations

PMID: 10223669
DOI: 10.1089/cmb.1999.6.125

Significantly lower entropy estimates for natural DNA sequences

D Loewenstern et al. J Comput Biol. 1999 Spring.

. 1999 Spring;6(1):125-42.

doi: 10.1089/cmb.1999.6.125.

Authors

D Loewenstern¹, P N Yianilos

Affiliation

¹ NEC Research Institute, Princeton, New Jersey 08540, USA. davel@research.nj.nec.com

PMID: 10223669
DOI: 10.1089/cmb.1999.6.125

Abstract

If DNA were a random string over its alphabet {A, C, G, T}, an optimal code would assign two bits to each nucleotide. DNA may be imagined to be a highly ordered, purposeful molecule, and one might therefore reasonably expect statistical models of its string representation to produce much lower entropy estimates. Surprisingly, this has not been the case for many natural DNA sequences, including portions of the human genome. We introduce a new statistical model (compression algorithm), the strongest reported to date, for naturally occurring DNA sequences. Conventional techniques code a nucleotide using only slightly fewer bits (1.90) than one obtains by relying only on the frequency statistics of individual nucleotides (1.95). Our method in some cases increases this gap by more than fivefold (1.66) and may lead to better performance in microbiological pattern recognition applications. One of our main contributions, and the principle source of these improvements, is the formal inclusion of inexact match information in the model. The existence of matches at various distances forms a panel of experts which are then combined into a single prediction. The structure of this combination is novel and its parameters are learned using Expectation Maximization (EM). Experiments are reported using a wide variety of DNA sequences and compared whenever possible with earlier work. Four reasonable notions for the string distance function used to identify near matches, are implemented and experimentally compared. We also report lower entropy estimates for coding regions extracted from a large collection of nonredundant human genes. The conventional estimate is 1.92 bits. Our model produces only slightly better results (1.91 bits) when considering nucleotides, but achieves 1.84-1.87 bits when the prediction problem is divided into two stages: (i) predict the next amino acid-based on inexact polypeptide matches, and (ii) predict the particular codon. Our results suggest that matches at the amino acid level play some role, but a small one, in determining the statistical structure of nonredundant coding sequences.

PubMed Disclaimer

Cited by

Toward a Better Compression for DNA Sequences Using Huffman Encoding.
Al-Okaily A, Almarri B, Al Yami S, Huang CH. Al-Okaily A, et al. J Comput Biol. 2017 Apr;24(4):280-288. doi: 10.1089/cmb.2016.0151. Epub 2016 Dec 13. J Comput Biol. 2017. PMID: 27960065 Free PMC article.
An Optimal Seed Based Compression Algorithm for DNA Sequences.
Eric PV, Gopalakrishnan G, Karunakaran M. Eric PV, et al. Adv Bioinformatics. 2016;2016:3528406. doi: 10.1155/2016/3528406. Epub 2016 Jul 31. Adv Bioinformatics. 2016. PMID: 27555868 Free PMC article.
GReEn: a tool for efficient compression of genome resequencing data.
Pinho AJ, Pratas D, Garcia SP. Pinho AJ, et al. Nucleic Acids Res. 2012 Feb;40(4):e27. doi: 10.1093/nar/gkr1124. Epub 2011 Dec 1. Nucleic Acids Res. 2012. PMID: 22139935 Free PMC article.
Comparative analysis of long DNA sequences by per element information content using different contexts.
Dix TI, Powell DR, Allison L, Bernal J, Jaeger S, Stern L. Dix TI, et al. BMC Bioinformatics. 2007 May 3;8 Suppl 2(Suppl 2):S10. doi: 10.1186/1471-2105-8-S2-S10. BMC Bioinformatics. 2007. PMID: 17493248 Free PMC article.
Entropy Estimation Using a Linguistic Zipf-Mandelbrot-Li Model for Natural Sequences.
Back AD, Wiles J. Back AD, et al. Entropy (Basel). 2021 Aug 24;23(9):1100. doi: 10.3390/e23091100. Entropy (Basel). 2021. PMID: 34573725 Free PMC article.

See all "Cited by" articles

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
- Atypon
Other Literature Sources
- The Lens - Patent Citations Database
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Significantly lower entropy estimates for natural DNA sequences

Affiliation

Significantly lower entropy estimates for natural DNA sequences

Authors

Affiliation

Abstract

Similar articles

Cited by

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous

Abstract

Similar articles

Cited by

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous