Data Compression Concepts and Algorithms and their Applications to Bioinformatics
- PMID: 20157640
- PMCID: PMC2821113
- DOI: 10.3390/e12010034
Data Compression Concepts and Algorithms and their Applications to Bioinformatics
Abstract
Data compression at its base is concerned with how information is organized in data. Understanding this organization can lead to efficient ways of representing the information and hence data compression. In this paper we review the ways in which ideas and approaches fundamental to the theory and practice of data compression have been used in the area of bioinformatics. We look at how basic theoretical ideas from data compression, such as the notions of entropy, mutual information, and complexity have been used for analyzing biological sequences in order to discover hidden patterns, infer phylogenetic relationships between organisms and study viral populations. Finally, we look at how inferred grammars for biological sequences have been used to uncover structure in biological sequences.
Figures






Similar articles
-
Causal discovery using compression-complexity measures.J Biomed Inform. 2021 May;117:103724. doi: 10.1016/j.jbi.2021.103724. Epub 2021 Mar 13. J Biomed Inform. 2021. PMID: 33722730
-
Bioinformatics tools for the sequence complexity estimates.Biophys Rev. 2023 Sep 15;15(5):1367-1378. doi: 10.1007/s12551-023-01140-y. eCollection 2023 Oct. Biophys Rev. 2023. PMID: 37974990 Free PMC article. Review.
-
Macromolecular crowding: chemistry and physics meet biology (Ascona, Switzerland, 10-14 June 2012).Phys Biol. 2013 Aug;10(4):040301. doi: 10.1088/1478-3975/10/4/040301. Epub 2013 Aug 2. Phys Biol. 2013. PMID: 23912807
-
A stochastic context free grammar based framework for analysis of protein sequences.BMC Bioinformatics. 2009 Oct 8;10:323. doi: 10.1186/1471-2105-10-323. BMC Bioinformatics. 2009. PMID: 19814800 Free PMC article.
-
Hidden Markov Models, grammars, and biology: a tutorial.J Bioinform Comput Biol. 2005 Apr;3(2):491-526. doi: 10.1142/s0219720005001077. J Bioinform Comput Biol. 2005. PMID: 15852517 Review.
Cited by
-
Algorithms designed for compressed-gene-data transformation among gene banks with different references.BMC Bioinformatics. 2018 Jun 18;19(1):230. doi: 10.1186/s12859-018-2230-2. BMC Bioinformatics. 2018. PMID: 29914357 Free PMC article.
-
Vertical lossless genomic data compression tools for assembled genomes: A systematic literature review.PLoS One. 2020 May 26;15(5):e0232942. doi: 10.1371/journal.pone.0232942. eCollection 2020. PLoS One. 2020. PMID: 32453750 Free PMC article.
-
AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models.Entropy (Basel). 2021 Apr 26;23(5):530. doi: 10.3390/e23050530. Entropy (Basel). 2021. PMID: 33925812 Free PMC article.
-
Information theory applications for biological sequence analysis.Brief Bioinform. 2014 May;15(3):376-89. doi: 10.1093/bib/bbt068. Epub 2013 Sep 20. Brief Bioinform. 2014. PMID: 24058049 Free PMC article. Review.
-
Conditional entropy in variation-adjusted windows detects selection signatures associated with expression quantitative trait loci (eQTLs).BMC Genomics. 2015;16 Suppl 8(Suppl 8):S8. doi: 10.1186/1471-2164-16-S8-S8. Epub 2015 Jun 18. BMC Genomics. 2015. PMID: 26111110 Free PMC article.
References
-
- Schrodinger E. What is Life? Cambridge University Press; 1944.
-
- Giancarlo R, Scaturro D, Utro F. Textual Data Compression In Computational Biology: A synopsis. Bioinformatics. 2009;25:1575–1586. - PubMed
-
- Gatlin L. Triplet frequencies in DNA and the genetic program. Journal of Theoretical Biology. 1963;5:360–371. - PubMed
-
- Gatlin L. The information content of DNA. Journal of Theoretical Biology. 1966;10:281–300. - PubMed
-
- Gatlin L. The information content of DNA II. Journal of Theoretical Biology. 1968;18:181–194. - PubMed
Grants and funding
LinkOut - more resources
Full Text Sources