Compression and fast retrieval of SNP data

Francesco Sambo¹, Barbara Di Camillo¹, Gianna Toffolo¹, Claudio Cobelli¹

Affiliations

PMID: 25064564
PMCID: PMC4609015
DOI: 10.1093/bioinformatics/btu495

Compression and fast retrieval of SNP data

Francesco Sambo et al. Bioinformatics. 2014.

. 2014 Nov 1;30(21):3078-85.

doi: 10.1093/bioinformatics/btu495. Epub 2014 Jul 26.

Authors

Francesco Sambo¹, Barbara Di Camillo¹, Gianna Toffolo¹, Claudio Cobelli¹

Affiliation

¹ Department of Information Engineering, University of Padova, via Gradenigo 6/a, 35131 Padova, Italy.

PMID: 25064564
PMCID: PMC4609015
DOI: 10.1093/bioinformatics/btu495

Abstract

Motivation: The increasing interest in rare genetic variants and epistatic genetic effects on complex phenotypic traits is currently pushing genome-wide association study design towards datasets of increasing size, both in the number of studied subjects and in the number of genotyped single nucleotide polymorphisms (SNPs). This, in turn, is leading to a compelling need for new methods for compression and fast retrieval of SNP data.

Results: We present a novel algorithm and file format for compressing and retrieving SNP data, specifically designed for large-scale association studies. Our algorithm is based on two main ideas: (i) compress linkage disequilibrium blocks in terms of differences with a reference SNP and (ii) compress reference SNPs exploiting information on their call rate and minor allele frequency. Tested on two SNP datasets and compared with several state-of-the-art software tools, our compression algorithm is shown to be competitive in terms of compression rate and to outperform all tools in terms of time to load compressed data.

Availability and implementation: Our compression and decompression algorithms are implemented in a C++ library, are released under the GNU General Public License and are freely downloadable from http://www.dei.unipd.it/~sambofra/snpack.html.

PubMed Disclaimer

Figures

**Fig. 1.**
Schematics of the five codes for compressing reference SNPs (a), plus the code for compressing summarized SNPs by storing their variations with respect to a reference SNP (b). For each code, we report the sequence of stored values and, on top of them, the number of allocated bits. (a) Code 1 contains the code ID (3 bits) followed by five 0 bits and the genotype of the subjects (four subjects per byte). The remaining codes report the code ID followed by *bdiff_xx*, i.e. the number of bits for representing the first index and the difference between consecutive indices of the subjects in category xx, where xx can be aa, aA, AA or NA for homozygous rare, heterozygous, homozygous frequent or missing subjects, respectively. Then codes 2–5 contain the number of subjects with genotype xx (*n_xx*) followed by the first xx index and the differences between the indices of consecutive subjects in category xx. In addition, codes 3 and 5 contain binary_array_aA and binary_array_NA, respectively with 1 in position i if the i-th subject is in category aA (or NA) and 0 otherwise. (b) Code 6, used to compress summarized SNPs, contains the code ID (3 bits) followed by 21 bits coding the distance, with sign (upstream or downstream), from the reference SNP; the number of variations wrt the reference SNP (*n_var*), the positions (indices_var) and values (variations) of the variations wrt the reference SNP

**Fig. 2.**
Median byte size, separately for each code, over 20 randomly sampled sets of 1000 SNPs from the WTCCC (left) and the 1000g (right) datasets. Whiskers extend from the first to the third quartile of the total byte size

**Fig. 3.**
Time in seconds for compression versus size in MB of the compressed files on disk for different values of the maximum neighbourhood size, for the WTCCC (a) and the 1000g (b) datasets

**Fig. 4.**
Time in seconds for compression, MB of the compressed files on disk and peak MB of RAM occupation during compression when splitting the computation for each chromosome in different numbers of chunks (between 1 and 32), for the WTCCC (a) and the 1000g (b) datasets

**Fig. 5.**
Histogram (log counts) of the neighbourhood extension of reference SNPs, for the WTCCC (top) and 1000 g (bottom) datasets

See this image and copyright information in PMC

References

1. 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65. - PMC - PubMed
1. Brandon MC, et al. Data structures and compression algorithms for genomic sequence data. Bioinformatics. 2009;25:1731–1738. - PMC - PubMed
1. Christley S, et al. Human genomes as email attachments. Bioinformatics. 2009;25:274–275. - PubMed
1. Danecek P, et al. The variant call format and VCFtools. Bioinformatics. 2011;27:2156–2158. - PMC - PubMed
1. Deorowicz S, et al. Genome compression: a novel approach for large collections. Bioinformatics. 2013;29:2572–2578. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Compression and fast retrieval of SNP data

Affiliation

Compression and fast retrieval of SNP data

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Miscellaneous