Efficient storage of high throughput DNA sequencing data using reference-based compression

Markus Hsi-Yang Fritz¹, Rasko Leinonen, Guy Cochrane, Ewan Birney

Affiliations

Affiliation

¹ European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, United Kingdom.

PMID: 21245279
PMCID: PMC3083090
DOI: 10.1101/gr.114819.110

Efficient storage of high throughput DNA sequencing data using reference-based compression

Markus Hsi-Yang Fritz et al. Genome Res. 2011 May.

. 2011 May;21(5):734-40.

doi: 10.1101/gr.114819.110. Epub 2011 Jan 18.

Authors

Markus Hsi-Yang Fritz¹, Rasko Leinonen, Guy Cochrane, Ewan Birney

Affiliation

¹ European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, United Kingdom.

PMID: 21245279
PMCID: PMC3083090
DOI: 10.1101/gr.114819.110

Abstract

Data storage costs have become an appreciable proportion of total cost in the creation and analysis of DNA sequence data. Of particular concern is that the rate of increase in DNA sequencing is significantly outstripping the rate of increase in disk storage capacity. In this paper we present a new reference-based compression method that efficiently compresses DNA sequences for storage. Our approach works for resequencing experiments that target well-studied genomes. We align new sequences to a reference genome and then encode the differences between the new sequence and the reference genome for storage. Our compression method is most efficient when we allow controlled loss of data in the saving of quality information and unaligned sequences. With this new compression method we observe exponential efficiency gains as read lengths increase, and the magnitude of this efficiency gain can be controlled by changing the amount of quality information stored. Our compression method is tunable: The storage of quality scores and unaligned sequences may be adjusted for different experiments to conserve information or to minimize storage costs, and provides one opportunity to address the threat that increasing DNA sequence volumes will overcome our ability to store the sequences.

PubMed Disclaimer

Figures

**Figure 1.**
Schematic of the compression technique. (A) Reads are first aligned to an established reference. (B) Unaligned reads are then pooled to create a specific “compression framework” for this data set. (C) The base pair information is then stored using specific offsets of reads on the reference, with substitutions, insertions, or deletions encoded in separate data structures.

**Figure 2.**
Compression efficiency for simulated data sets. The plot shows storage of DNA sequence expressed as a bits/base stored on the y-axis (log scale) vs. coverage of data sets (x-axis) for different read lengths (the different colors) after reference-based compression. The different columns indicate different simulated error rates (0.01%, 0.1%, 1.0%). The *left* three panels show this for unpaired data, the *right* three for paired data.

**Figure 3.**
Storage components for three parameterizations of simulated data: 0.1% error and 1× coverage (*left* panel), 1% error and 1× coverage (*middle*), and 1% error and 25× coverage (*right*). *readpos* and *readflags* is the storage of the read positions and read flags (strand, exact match), respectively. Variation storage for substitutions (*subst*), insertions (*insert*), and deletions (*del*) is split into positional information (*pos*), flags (*flags*), and bases (*bases*, for substitutions and insertions) or length (*len*, for deletions). The pie charts show overall storage requirements, where *readinfo* sums over read positions and read flags, and *variation* is the sum over all variation storage components.

**Figure 4.**
Storage costs for different quality budgets. The plot shows the change in storage cost (expressed as bits/base, including quality information, y-axis) with respect to read length for different quality budgets for a fixed coverage (10×) simulated data set. Note that not only do lower quality budgets compress better, but also the compression efficiency improves proportionally more at lower quality budgets for higher read lengths. Quality budgets are the percentage of base pairs in the data set for which quality scores are retained.

See this image and copyright information in PMC

References

1. The 1000 Genomes Project Consortium 2010. A map of human genome variation from population-scale sequencing. Nature 467: 1061–1073 - PMC - PubMed
1. Chen X, Li M, Ma B, Tromp J 2002. DNACompress: Fast and effective DNA sequence compression. Bioinformatics 18: 1696–1698 - PubMed
1. Christley S, Lu Y, Li C, Xie X 2009. Human genomes as email attachments. Bioinformatics 25: 274–275 - PubMed
1. Daily K, Rigor P, Christley S, Xie X, Baldi P 2010. Data structures and compression algorithms for high-throughput sequencing technologies. BMC Bioinformatics 11: 514 doi: 10.1186/1471-2105-11-514 - PMC - PubMed
1. Elias P 1975. Universal codeword sets and representations of the integers. IEEE Trans Inf Theory 21: 194–203

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

085532/WT_/Wellcome Trust/United Kingdom

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Efficient storage of high throughput DNA sequencing data using reference-based compression

Affiliation

Efficient storage of high throughput DNA sequencing data using reference-based compression

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources