Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Nov 1;30(21):3078-85.
doi: 10.1093/bioinformatics/btu495. Epub 2014 Jul 26.

Compression and fast retrieval of SNP data

Affiliations

Compression and fast retrieval of SNP data

Francesco Sambo et al. Bioinformatics. .

Abstract

Motivation: The increasing interest in rare genetic variants and epistatic genetic effects on complex phenotypic traits is currently pushing genome-wide association study design towards datasets of increasing size, both in the number of studied subjects and in the number of genotyped single nucleotide polymorphisms (SNPs). This, in turn, is leading to a compelling need for new methods for compression and fast retrieval of SNP data.

Results: We present a novel algorithm and file format for compressing and retrieving SNP data, specifically designed for large-scale association studies. Our algorithm is based on two main ideas: (i) compress linkage disequilibrium blocks in terms of differences with a reference SNP and (ii) compress reference SNPs exploiting information on their call rate and minor allele frequency. Tested on two SNP datasets and compared with several state-of-the-art software tools, our compression algorithm is shown to be competitive in terms of compression rate and to outperform all tools in terms of time to load compressed data.

Availability and implementation: Our compression and decompression algorithms are implemented in a C++ library, are released under the GNU General Public License and are freely downloadable from http://www.dei.unipd.it/~sambofra/snpack.html.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Schematics of the five codes for compressing reference SNPs (a), plus the code for compressing summarized SNPs by storing their variations with respect to a reference SNP (b). For each code, we report the sequence of stored values and, on top of them, the number of allocated bits. (a) Code 1 contains the code ID (3 bits) followed by five 0 bits and the genotype of the subjects (four subjects per byte). The remaining codes report the code ID followed by bdiffxx, i.e. the number of bits for representing the first index and the difference between consecutive indices of the subjects in category xx, where xx can be aa, aA, AA or NA for homozygous rare, heterozygous, homozygous frequent or missing subjects, respectively. Then codes 2–5 contain the number of subjects with genotype xx (nxx) followed by the first xx index and the differences between the indices of consecutive subjects in category xx. In addition, codes 3 and 5 contain binary_arrayaA and binary_arrayNA, respectively with 1 in position i if the i-th subject is in category aA (or NA) and 0 otherwise. (b) Code 6, used to compress summarized SNPs, contains the code ID (3 bits) followed by 21 bits coding the distance, with sign (upstream or downstream), from the reference SNP; the number of variations wrt the reference SNP (nvar), the positions (indicesvar) and values (variations) of the variations wrt the reference SNP
Fig. 2.
Fig. 2.
Median byte size, separately for each code, over 20 randomly sampled sets of 1000 SNPs from the WTCCC (left) and the 1000g (right) datasets. Whiskers extend from the first to the third quartile of the total byte size
Fig. 3.
Fig. 3.
Time in seconds for compression versus size in MB of the compressed files on disk for different values of the maximum neighbourhood size, for the WTCCC (a) and the 1000g (b) datasets
Fig. 4.
Fig. 4.
Time in seconds for compression, MB of the compressed files on disk and peak MB of RAM occupation during compression when splitting the computation for each chromosome in different numbers of chunks (between 1 and 32), for the WTCCC (a) and the 1000g (b) datasets
Fig. 5.
Fig. 5.
Histogram (log counts) of the neighbourhood extension of reference SNPs, for the WTCCC (top) and 1000 g (bottom) datasets

References

    1. 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65. - PMC - PubMed
    1. Brandon MC, et al. Data structures and compression algorithms for genomic sequence data. Bioinformatics. 2009;25:1731–1738. - PMC - PubMed
    1. Christley S, et al. Human genomes as email attachments. Bioinformatics. 2009;25:274–275. - PubMed
    1. Danecek P, et al. The variant call format and VCFtools. Bioinformatics. 2011;27:2156–2158. - PMC - PubMed
    1. Deorowicz S, et al. Genome compression: a novel approach for large collections. Bioinformatics. 2013;29:2572–2578. - PubMed

Publication types