. 2015 Sep 14:16:288.

doi: 10.1186/s12859-015-0709-7.

Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph

Gaëtan Benoit¹, Claire Lemaitre², Dominique Lavenier³, Erwan Drezen⁴, Thibault Dayris⁵, Raluca Uricaru^{6

7}, Guillaume Rizk⁸

Affiliations

¹ INRIA/IRISA/GenScale, Campus de Beaulieu, Rennes, 35042, France. gaetan.benoit@inria.fr.
² INRIA/IRISA/GenScale, Campus de Beaulieu, Rennes, 35042, France. claire.lemaitre@inria.fr.
³ INRIA/IRISA/GenScale, Campus de Beaulieu, Rennes, 35042, France. dominique.lavenier@irisa.fr.
⁴ INRIA/IRISA/GenScale, Campus de Beaulieu, Rennes, 35042, France. erwan.drezen@inria.fr.
⁵ University of Bordeaux, CNRS/LaBRI, Talence, F-33405, France. thibault.dayris@outlook.fr.
⁶ University of Bordeaux, CNRS/LaBRI, Talence, F-33405, France. ruricaru@labri.fr.
⁷ University of Bordeaux, CBiB, Bordeaux, F-33000, France. ruricaru@labri.fr.
⁸ INRIA/IRISA/GenScale, Campus de Beaulieu, Rennes, 35042, France. guillaume.rizk@gmail.com.

PMID: 26370285
PMCID: PMC4570262
DOI: 10.1186/s12859-015-0709-7

Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph

Gaëtan Benoit et al. BMC Bioinformatics. 2015.

. 2015 Sep 14:16:288.

doi: 10.1186/s12859-015-0709-7.

Authors

Gaëtan Benoit¹, Claire Lemaitre², Dominique Lavenier³, Erwan Drezen⁴, Thibault Dayris⁵, Raluca Uricaru^{6

7}, Guillaume Rizk⁸

Affiliations

¹ INRIA/IRISA/GenScale, Campus de Beaulieu, Rennes, 35042, France. gaetan.benoit@inria.fr.
² INRIA/IRISA/GenScale, Campus de Beaulieu, Rennes, 35042, France. claire.lemaitre@inria.fr.
³ INRIA/IRISA/GenScale, Campus de Beaulieu, Rennes, 35042, France. dominique.lavenier@irisa.fr.
⁴ INRIA/IRISA/GenScale, Campus de Beaulieu, Rennes, 35042, France. erwan.drezen@inria.fr.
⁵ University of Bordeaux, CNRS/LaBRI, Talence, F-33405, France. thibault.dayris@outlook.fr.
⁶ University of Bordeaux, CNRS/LaBRI, Talence, F-33405, France. ruricaru@labri.fr.
⁷ University of Bordeaux, CBiB, Bordeaux, F-33000, France. ruricaru@labri.fr.
⁸ INRIA/IRISA/GenScale, Campus de Beaulieu, Rennes, 35042, France. guillaume.rizk@gmail.com.

PMID: 26370285
PMCID: PMC4570262
DOI: 10.1186/s12859-015-0709-7

Abstract

Background: Data volumes generated by next-generation sequencing (NGS) technologies is now a major concern for both data storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip method.

Results: We present a novel reference-free method meant to compress data issued from high throughput sequencing technologies. Our approach, implemented in the software LEON, employs techniques derived from existing assembly principles. The method is based on a reference probabilistic de Bruijn Graph, built de novo from the set of reads and stored in a Bloom filter. Each read is encoded as a path in this graph, by memorizing an anchoring kmer and a list of bifurcations. The same probabilistic de Bruijn Graph is used to perform a lossy transformation of the quality scores, which allows to obtain higher compression rates without losing pertinent information for downstream analyses.

Conclusions: LEON was run on various real sequencing datasets (whole genome, exome, RNA-seq or metagenomics). In all cases, LEON showed higher overall compression ratios than state-of-the-art compression software. On a C. elegans whole genome sequencing dataset, LEON divided the original file size by more than 20. LEON is an open source software, distributed under GNU affero GPL License, available for download at http://gatb.inria.fr/software/leon/.

PubMed Disclaimer

Figures

**Fig. 1**
LEON method overview. First, a *de Bruijn Graph* is constructed from the reads: kmers are counted, then abundant enough kmers are inserted into a bloom filter representing a probabilistic *de Bruijn Graph*. Reads are then mapped to this graph, and the necessary information required to rebuild the reads from the graph is stored in the compressed file: an anchoring kmer and a list of bifurcations

**Fig. 2**
Schematic description of LEON’s path encoding. In the upper part, the mapping of two reads to the *de Bruijn Graph* is represented. Kmer anchors are shown in blue, bifurcations to (read on the left side) or difference from the graph (read on the right side) are respectively highlighted in green and red. In the bottom part, the corresponding path encodings for these two reads are shown: the index of the kmer anchor, and for each side the path length and bifurcation list

**Fig. 3**
Components contribution in sequence compression. Sequence compression ratio (top) and relative contribution of each component in the compressed sequence stream (bottom) for diverse datasets. WGS high means high coverage (116, 70 and 102 x respectively), WGS low means down-sampling to 10x

**Fig. 4**
Sequence compression ratios by coverage. Compression ratios obtained by LEON on the sequence stream, with respect to the sequencing coverage of the datasets. The three WGS datasets were down-sampled to obtain lower coverage

**Fig. 5**
Compression ratios comparison. Comparison of compression ratios between *de novo* compression software for diverse datasets. On top, overall compression factor (orignal file size / compressed file size). The bottom part represents space distribution between header, sequence and quality scores (respectively in red, green and blue)

**Fig. 6**
Compression / accuracy trade-off for quality compression. Impact of *lossy* compression methods of quality scores on SNP calling, for a human chromosome 20 (HG00096 individual, SRR062634) compared to a gold standard. Each line represents the F-score/compressed size trade-off for a method, the higher the line, the better. The dashed line represents the F-score obtained by the original fastq file and by lossless compression methods

See this image and copyright information in PMC

References

1. Leinonen R, Sugawara H, Shumway M. The sequence read archive. Nucleic Acids Res. 2010;39:1019. - PMC - PubMed
1. Jones DC, Ruzzo WL, Peng X, Katze MG. Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res. 2012;40(22):171. doi: 10.1093/nar/gks754. - DOI - PMC - PubMed
1. Fritz MHY, Leinonen R, Cochrane G, Birney E. Efficient storage of high throughput sequencing data using reference-based compression. Genome Res. 2011;21:734–40. doi: 10.1101/gr.114819.110. - DOI - PMC - PubMed
1. Kingsford C, Patro R. Reference-based compression of short-read sequences using path encoding. Bioinformatics. 2015;31:071. doi: 10.1093/bioinformatics/btv071. - DOI - PMC - PubMed
1. Bonfield JK, Mahoney MV. Compression of fastq and sam format sequencing data. PLoS One. 2013;8(3):59190. doi: 10.1371/journal.pone.0059190. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph

Affiliations

Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases