Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Jan 1;12(1):34.
doi: 10.3390/e12010034.

Data Compression Concepts and Algorithms and their Applications to Bioinformatics

Affiliations

Data Compression Concepts and Algorithms and their Applications to Bioinformatics

O U Nalbantog̃lu et al. Entropy (Basel). .

Abstract

Data compression at its base is concerned with how information is organized in data. Understanding this organization can lead to efficient ways of representing the information and hence data compression. In this paper we review the ways in which ideas and approaches fundamental to the theory and practice of data compression have been used in the area of bioinformatics. We look at how basic theoretical ideas from data compression, such as the notions of entropy, mutual information, and complexity have been used for analyzing biological sequences in order to discover hidden patterns, infer phylogenetic relationships between organisms and study viral populations. Finally, we look at how inferred grammars for biological sequences have been used to uncover structure in biological sequences.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Plot of the redundancy rate versus D2D1+D2 using the genes available at the time shows a clear segregation of Phage, Bacteria, and Vertebrate sequences.
Figure 2
Figure 2
Inclusion of additional sequences breaks down the segregation observed by Gatlin.
Figure 3
Figure 3
The logo of a number of sequences at the beginning of a gene. The start codon ATG is immediately apparent. The logo was constructed using the software at http://weblogo.threeplusone.com/.
Figure 4
Figure 4
AMI charts for HIV-1 populations isolated from patients who remained asymptomatic. The large number of white pixels indicate generally a high degree of covariation while “checkerboard” regions indicate specific segments of the envelope protein with correlated mutations [20].
Figure 5
Figure 5
AMI charts for HIV-1 populations isolated from patients who succumbed to AIDS. The preponderance of black pixels indicates a relatively homogeneous population [20].
Figure 6
Figure 6
A block diagram depicting the basic steps involved with a grammar-based compression scheme.

Similar articles

Cited by

References

    1. Schrodinger E. What is Life? Cambridge University Press; 1944.
    1. Giancarlo R, Scaturro D, Utro F. Textual Data Compression In Computational Biology: A synopsis. Bioinformatics. 2009;25:1575–1586. - PubMed
    1. Gatlin L. Triplet frequencies in DNA and the genetic program. Journal of Theoretical Biology. 1963;5:360–371. - PubMed
    1. Gatlin L. The information content of DNA. Journal of Theoretical Biology. 1966;10:281–300. - PubMed
    1. Gatlin L. The information content of DNA II. Journal of Theoretical Biology. 1968;18:181–194. - PubMed

LinkOut - more resources