Data structures and compression algorithms for genomic sequence data
- PMID: 19447783
- PMCID: PMC2705231
- DOI: 10.1093/bioinformatics/btp319
Data structures and compression algorithms for genomic sequence data
Abstract
Motivation: The continuing exponential accumulation of full genome data, including full diploid human genomes, creates new challenges not only for understanding genomic structure, function and evolution, but also for the storage, navigation and privacy of genomic data. Here, we develop data structures and algorithms for the efficient storage of genomic and other sequence data that may also facilitate querying and protecting the data.
Results: The general idea is to encode only the differences between a genome sequence and a reference sequence, using absolute or relative coordinates for the location of the differences. These locations and the corresponding differential variants can be encoded into binary strings using various entropy coding methods, from fixed codes such as Golomb and Elias codes, to variables codes, such as Huffman codes. We demonstrate the approach and various tradeoffs using highly variables human mitochondrial genome sequences as a testbed. With only a partial level of optimization, 3615 genome sequences occupying 56 MB in GenBank are compressed down to only 167 KB, achieving a 345-fold compression rate, using the revised Cambridge Reference Sequence as the reference sequence. Using the consensus sequence as the reference sequence, the data can be stored using only 133 KB, corresponding to a 433-fold level of compression, roughly a 23% improvement. Extensions to nuclear genomes and high-throughput sequencing data are discussed.
Availability: Data are publicly available from GenBank, the HapMap web site, and the MITOMAP database. Supplementary materials with additional results, statistics, and software implementations are available from http://mammag.web.uci.edu/bin/view/Mitowiki/ProjectDNACompression.
Figures


Similar articles
-
CoGI: Towards Compressing Genomes as an Image.IEEE/ACM Trans Comput Biol Bioinform. 2015 Nov-Dec;12(6):1275-85. doi: 10.1109/TCBB.2015.2430331. IEEE/ACM Trans Comput Biol Bioinform. 2015. PMID: 26671800
-
Toward a Better Compression for DNA Sequences Using Huffman Encoding.J Comput Biol. 2017 Apr;24(4):280-288. doi: 10.1089/cmb.2016.0151. Epub 2016 Dec 13. J Comput Biol. 2017. PMID: 27960065 Free PMC article.
-
Efficient storage of high throughput DNA sequencing data using reference-based compression.Genome Res. 2011 May;21(5):734-40. doi: 10.1101/gr.114819.110. Epub 2011 Jan 18. Genome Res. 2011. PMID: 21245279 Free PMC article.
-
Vertical lossless genomic data compression tools for assembled genomes: A systematic literature review.PLoS One. 2020 May 26;15(5):e0232942. doi: 10.1371/journal.pone.0232942. eCollection 2020. PLoS One. 2020. PMID: 32453750 Free PMC article.
-
When the levee breaks: a practical guide to sketching algorithms for processing the flood of genomic data.Genome Biol. 2019 Sep 13;20(1):199. doi: 10.1186/s13059-019-1809-x. Genome Biol. 2019. PMID: 31519212 Free PMC article. Review.
Cited by
-
Bitpacking techniques for indexing genomes: I. Hash tables.Algorithms Mol Biol. 2016 Apr 18;11:5. doi: 10.1186/s13015-016-0069-5. eCollection 2016. Algorithms Mol Biol. 2016. PMID: 27095998 Free PMC article.
-
NGC: lossless and lossy compression of aligned high-throughput sequencing data.Nucleic Acids Res. 2013 Jan 7;41(1):e27. doi: 10.1093/nar/gks939. Epub 2012 Oct 12. Nucleic Acids Res. 2013. PMID: 23066097 Free PMC article.
-
TRCMGene: A two-step referential compression method for the efficient storage of genetic data.PLoS One. 2018 Nov 5;13(11):e0206521. doi: 10.1371/journal.pone.0206521. eCollection 2018. PLoS One. 2018. PMID: 30395579 Free PMC article.
-
DNA barcode goes two-dimensions: DNA QR code web server.PLoS One. 2012;7(5):e35146. doi: 10.1371/journal.pone.0035146. Epub 2012 May 4. PLoS One. 2012. PMID: 22574113 Free PMC article.
-
Compression and fast retrieval of SNP data.Bioinformatics. 2014 Nov 1;30(21):3078-85. doi: 10.1093/bioinformatics/btu495. Epub 2014 Jul 26. Bioinformatics. 2014. PMID: 25064564 Free PMC article.
References
-
- Anderson S, et al. Sequence and organization of the human mitochondrial genome. Nature. 1981;290:457–465. - PubMed
-
- Andrews R, et al. Reanalysis and revision of the cambridge reference sequence for human mitochondrial DNA. Nat. Genet. 1999;2:147. - PubMed
-
- Behzadi B, Fessant F. DNA compression challenge revisited: a dynamic programming approach. Lect. Notes Comput. Sci. 2005;3537:190–200.
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Miscellaneous