Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013;8(3):e59190.
doi: 10.1371/journal.pone.0059190. Epub 2013 Mar 22.

Compression of FASTQ and SAM format sequencing data

Affiliations

Compression of FASTQ and SAM format sequencing data

James K Bonfield et al. PLoS One. 2013.

Abstract

Storage and transmission of the data produced by modern DNA sequencing instruments has become a major concern, which prompted the Pistoia Alliance to pose the SequenceSqueeze contest for compression of FASTQ files. We present several compression entries from the competition, Fastqz and Samcomp/Fqzcomp, including the winning entry. These are compared against existing algorithms for both reference based compression (CRAM, Goby) and non-reference based compression (DSRC, BAM) and other recently published competition entries (Quip, SCALCE). The tools are shown to be the new Pareto frontier for FASTQ compression, offering state of the art ratios at affordable CPU costs. All programs are freely available on SourceForge. Fastqz: https://sourceforge.net/projects/fastqz/, fqzcomp: https://sourceforge.net/projects/fqzcomp/, and samcomp: https://sourceforge.net/projects/samcomp/.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The statement of the work being supported by Wellcome Trust and Dell Inc. is there solely as they covered the salaries of the authors during the time of the program development. Neither the Wellcome Trust nor Dell Inc. hold any patents on this work, nor do they intend to. Please also note that the programs are available under a BSD-2 open source licence, meaning they are free with no restrictions. MM is employed by Dell Inc. There are no patents, products in development or marketed products to declare. This does not alter the authors’ adherence to all the PLOS ONE policies on sharing data and materials, as detailed online in the guide for authors.

Figures

Figure 1
Figure 1. SequenceSqueeze results: real time vs compression ratio.
Each mark represents a different entry, coloured by author. Ibrahim Numanagic’s entry had a minor decoding problem causing a minority of read-names to mismatch. All other entries plotted here were lossless. The entries have been broken down into reference based (A) and non-reference based (B) solutions. A clear wall can be seen in the non-reference methods requiring exponential growth in CPU time for minor linear improvements in compression ratio.
Figure 2
Figure 2. File size ratios vs real time to compress.
SRR007215 is SOLiD data, SRR003177 is 454 data, while SRR02750 and SRR065390 are Illumina data at shallow and deep depths respectively. Not all programs support all types of data.

References

    1. Cochrane G, Cook CE, Birney E (2012) The future of dna sequence archiving. GigaScience 1: 2. - PMC - PubMed
    1. Cock PJA, Fields CJ, Goto N, Heuer ML, Rice PM (2010) The sanger FASTQ format for sequences with quality scores, and the Solexa/fillumina FASTQ variants. Nucleic Acids Research 38: 1767–1771. - PMC - PubMed
    1. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, et al. (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature 437: 376–380. - PMC - PubMed
    1. Pandey V, Nutter RC, Prediger E (2008) Next-Generation Genome Sequencing, Berlin, Germany: Wiley- VCH, chapter Applied Biosystems SOLiD System: Ligation-Based Sequencing. 29–41.
    1. Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, et al. (2008) Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456: 53–59. - PMC - PubMed

Publication types

MeSH terms