Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jan 15;35(2):337-339.
doi: 10.1093/bioinformatics/bty608.

Crumble: reference free lossy compression of sequence quality values

Affiliations

Crumble: reference free lossy compression of sequence quality values

James K Bonfield et al. Bioinformatics. .

Abstract

Motivation: The bulk of space taken up by NGS sequencing CRAM files consists of per-base quality values. Most of these are unnecessary for variant calling, offering an opportunity for space saving.

Results: On the Syndip test set, a 17 fold reduction in the quality storage portion of a CRAM file can be achieved while maintaining variant calling accuracy. The size reduction of an entire CRAM file varied from 2.2 to 7.4 fold, depending on the non-quality content of the original file (see Supplementary Material S6 for details).

Availability and implementation: Crumble is OpenSource and can be obtained from https://github.com/jkbonfield/crumble.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

References

    1. Benoit G. et al. (2015) Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph. BMC Bioinformatics, 16, 288.. - PMC - PubMed
    1. Bonfield J.K., Whitwham A. (2010) Gap5–editing the billion fragment sequence assembly. Bioinformatics, 26, 1699–1703. - PMC - PubMed
    1. Cánovas R. et al. (2014) Lossy compression of quality scores in genomic data. Bioinformatics, 30, 2130–2136. - PubMed
    1. Fritz M.H.-Y. et al. (2011) Efficient storage of high throughput dna sequencing data using reference-based compression. Genome Res., 21, 734–740. - PMC - PubMed
    1. Garrison E., Marth G. (2012) Haplotype-based variant detection from short-read sequencing. arXiv Preprint arXiv, 1207, 3907.

Publication types

MeSH terms