Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Jun 15;31(12):1920-8.
doi: 10.1093/bioinformatics/btv071. Epub 2015 Feb 2.

Reference-based compression of short-read sequences using path encoding

Affiliations

Reference-based compression of short-read sequences using path encoding

Carl Kingsford et al. Bioinformatics. .

Abstract

Motivation: Storing, transmitting and archiving data produced by next-generation sequencing is a significant computational burden. New compression techniques tailored to short-read sequence data are needed.

Results: We present here an approach to compression that reduces the difficulty of managing large-scale sequencing data. Our novel approach sits between pure reference-based compression and reference-free compression and combines much of the benefit of reference-based approaches with the flexibility of de novo encoding. Our method, called path encoding, draws a connection between storing paths in de Bruijn graphs and context-dependent arithmetic coding. Supporting this method is a system to compactly store sets of kmers that is of independent interest. We are able to encode RNA-seq reads using 3-11% of the space of the sequence in raw FASTA files, which is on average more than 34% smaller than competing approaches. We also show that even if the reference is very poorly matched to the reads that are being encoded, good compression can still be achieved.

Availability and implementation: Source code and binaries freely available for download at http://www.cs.cmu.edu/∼ckingsf/software/pathenc/, implemented in Go and supported on Linux and Mac OS X.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Sizes of the various components of the compressed files. ‘Read tails’ are the portion of the reads encoded using arithmetic encoding. ‘Bit tree’ gives the storage used by the bit tree for encoding the read starts (the first k = 16 letters of each read). ‘Read head counts’ is the space taken to store the number of reads with each start. ‘N locations’ is the space to store the location of input Ns that were changed to As upon encoding. ‘Flipped bits’ gives the space needed to record (in a compressed format) a single bit for each read indicating whether the read was reverse complemented
Fig. 2.
Fig. 2.
Performance when several features of the path encoding scheme are disabled. All values are given as percentage over the encoding size for the encoding that uses all the features. (A) ‘No reference’ starts with an empty transcriptome reference. ‘No reverse complement’ disables the reverse complementation of the reads. ‘No duplicate handling’ disables the recognition and special encoding of exact duplicate reads. (B) ‘No dynamic updates’ gives the compression when the probabilities of the statistical model are not updated as reads are encoded
Fig. 3.
Fig. 3.
Effective of kmer length. File size, represented as a fraction of the 2-bit encoding size, using various kmer lengths k.

Similar articles

Cited by

References

    1. Adjeroh D., et al. (2002) DNA sequence compression using the Burrows-Wheeler transform. In: Procceeding IEEE Computer Society Bioinformatics Conference. Vol. 1, IEEE Computer Society, Washington, DC, pp. 303–313. - PubMed
    1. Bhola V., et al. (2011) No-reference compression of genomic data stored in FASTQ format. In: IEEE International Conference on Bioinformatics and Biomedicine. IEEE Computer Society, Washington, DC, pp. 147–150
    1. Bonfield J.K. (2014) The Scramble conversion tool. Bioinformatics, 30, 2818–2819. - PMC - PubMed
    1. Bonfield J.K., Mahoney M.V. (2013) Compression of FASTQ and SAM format sequencing data. PLoS One, 8(3), e59190. - PMC - PubMed
    1. Brandon M.C., et al. (2009) Data structures and compression algorithms for genomic sequence data. Bioinformatics, 25, 1731–1738. - PMC - PubMed

Publication types