Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Jul 15;25(14):1731-8.
doi: 10.1093/bioinformatics/btp319. Epub 2009 May 15.

Data structures and compression algorithms for genomic sequence data

Affiliations

Data structures and compression algorithms for genomic sequence data

Marty C Brandon et al. Bioinformatics. .

Abstract

Motivation: The continuing exponential accumulation of full genome data, including full diploid human genomes, creates new challenges not only for understanding genomic structure, function and evolution, but also for the storage, navigation and privacy of genomic data. Here, we develop data structures and algorithms for the efficient storage of genomic and other sequence data that may also facilitate querying and protecting the data.

Results: The general idea is to encode only the differences between a genome sequence and a reference sequence, using absolute or relative coordinates for the location of the differences. These locations and the corresponding differential variants can be encoded into binary strings using various entropy coding methods, from fixed codes such as Golomb and Elias codes, to variables codes, such as Huffman codes. We demonstrate the approach and various tradeoffs using highly variables human mitochondrial genome sequences as a testbed. With only a partial level of optimization, 3615 genome sequences occupying 56 MB in GenBank are compressed down to only 167 KB, achieving a 345-fold compression rate, using the revised Cambridge Reference Sequence as the reference sequence. Using the consensus sequence as the reference sequence, the data can be stored using only 133 KB, corresponding to a 433-fold level of compression, roughly a 23% improvement. Extensions to nuclear genomes and high-throughput sequencing data are discussed.

Availability: Data are publicly available from GenBank, the HapMap web site, and the MITOMAP database. Supplementary materials with additional results, statistics, and software implementations are available from http://mammag.web.uci.edu/bin/view/Mitowiki/ProjectDNACompression.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Distribution of intervals between variations using a log rank-log frequency plot. x-axis represents the logarithm of the rank associated with decreasing interval frequencies. y-axis represents the logarithm of the corresponding counts.
Fig. 2.
Fig. 2.
Simplified haplotype classification used in (Brandon et al., 2009).

Similar articles

Cited by

References

    1. Anderson S, et al. Sequence and organization of the human mitochondrial genome. Nature. 1981;290:457–465. - PubMed
    1. Andrews R, et al. Reanalysis and revision of the cambridge reference sequence for human mitochondrial DNA. Nat. Genet. 1999;2:147. - PubMed
    1. Baldi P, et al. Lossless compression of chemical fingerprints using integer entropy codes improves storage and retrieval. J. Chem. Inf. Model. 2007;47:2098–2109. - PMC - PubMed
    1. Behzadi B, Fessant F. DNA compression challenge revisited: a dynamic programming approach. Lect. Notes Comput. Sci. 2005;3537:190–200.
    1. Brandon M, et al. MITOMASTER: a bioinformatics tool for the analysis of mitochondrial DNA sequences. Hum. Mutat. 2009;30(Database Issue):1–6. - PMC - PubMed

Publication types