Reference-based compression of short-read sequences using path encoding
- PMID: 25649622
- PMCID: PMC4481695
- DOI: 10.1093/bioinformatics/btv071
Reference-based compression of short-read sequences using path encoding
Abstract
Motivation: Storing, transmitting and archiving data produced by next-generation sequencing is a significant computational burden. New compression techniques tailored to short-read sequence data are needed.
Results: We present here an approach to compression that reduces the difficulty of managing large-scale sequencing data. Our novel approach sits between pure reference-based compression and reference-free compression and combines much of the benefit of reference-based approaches with the flexibility of de novo encoding. Our method, called path encoding, draws a connection between storing paths in de Bruijn graphs and context-dependent arithmetic coding. Supporting this method is a system to compactly store sets of kmers that is of independent interest. We are able to encode RNA-seq reads using 3-11% of the space of the sequence in raw FASTA files, which is on average more than 34% smaller than competing approaches. We also show that even if the reference is very poorly matched to the reads that are being encoded, good compression can still be achieved.
Availability and implementation: Source code and binaries freely available for download at http://www.cs.cmu.edu/∼ckingsf/software/pathenc/, implemented in Go and supported on Linux and Mac OS X.
© The Author 2015. Published by Oxford University Press.
Figures



Similar articles
-
Data-dependent bucketing improves reference-free compression of sequencing reads.Bioinformatics. 2015 Sep 1;31(17):2770-7. doi: 10.1093/bioinformatics/btv248. Epub 2015 Apr 24. Bioinformatics. 2015. PMID: 25910696 Free PMC article.
-
SCALCE: boosting sequence compression algorithms using locally consistent encoding.Bioinformatics. 2012 Dec 1;28(23):3051-7. doi: 10.1093/bioinformatics/bts593. Epub 2012 Oct 9. Bioinformatics. 2012. PMID: 23047557 Free PMC article.
-
smallWig: parallel compression of RNA-seq WIG files.Bioinformatics. 2016 Jan 15;32(2):173-80. doi: 10.1093/bioinformatics/btv561. Epub 2015 Sep 30. Bioinformatics. 2016. PMID: 26424856 Free PMC article.
-
Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies.Brief Bioinform. 2014 May;15(3):390-406. doi: 10.1093/bib/bbt088. Epub 2013 Dec 17. Brief Bioinform. 2014. PMID: 24347576 Review.
-
The present and future of de novo whole-genome assembly.Brief Bioinform. 2018 Jan 1;19(1):23-40. doi: 10.1093/bib/bbw096. Brief Bioinform. 2018. PMID: 27742661 Review.
Cited by
-
BdBG: a bucket-based method for compressing genome sequencing data with dynamic de Bruijn graphs.PeerJ. 2018 Oct 19;6:e5611. doi: 10.7717/peerj.5611. eCollection 2018. PeerJ. 2018. PMID: 30364599 Free PMC article.
-
ARSDA: A New Approach for Storing, Transmitting and Analyzing Transcriptomic Data.G3 (Bethesda). 2017 Dec 4;7(12):3839-3848. doi: 10.1534/g3.117.300271. G3 (Bethesda). 2017. PMID: 29079682 Free PMC article.
-
Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph.BMC Bioinformatics. 2015 Sep 14;16:288. doi: 10.1186/s12859-015-0709-7. BMC Bioinformatics. 2015. PMID: 26370285 Free PMC article.
-
Data-dependent bucketing improves reference-free compression of sequencing reads.Bioinformatics. 2015 Sep 1;31(17):2770-7. doi: 10.1093/bioinformatics/btv248. Epub 2015 Apr 24. Bioinformatics. 2015. PMID: 25910696 Free PMC article.
-
Optimal compressed representation of high throughput sequence data via light assembly.Nat Commun. 2018 Feb 8;9(1):566. doi: 10.1038/s41467-017-02480-6. Nat Commun. 2018. PMID: 29422526 Free PMC article.
References
-
- Adjeroh D., et al. (2002) DNA sequence compression using the Burrows-Wheeler transform. In: Procceeding IEEE Computer Society Bioinformatics Conference. Vol. 1, IEEE Computer Society, Washington, DC, pp. 303–313. - PubMed
-
- Bhola V., et al. (2011) No-reference compression of genomic data stored in FASTQ format. In: IEEE International Conference on Bioinformatics and Biomedicine. IEEE Computer Society, Washington, DC, pp. 147–150
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources