Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Sep 1;32(17):i479-i486.
doi: 10.1093/bioinformatics/btw437.

GTRAC: fast retrieval from compressed collections of genomic variants

Affiliations

GTRAC: fast retrieval from compressed collections of genomic variants

Kedar Tatwawadi et al. Bioinformatics. .

Abstract

Motivation: The dramatic decrease in the cost of sequencing has resulted in the generation of huge amounts of genomic data, as evidenced by projects such as the UK10K and the Million Veteran Project, with the number of sequenced genomes ranging in the order of 10 K to 1 M. Due to the large redundancies among genomic sequences of individuals from the same species, most of the medical research deals with the variants in the sequences as compared with a reference sequence, rather than with the complete genomic sequences. Consequently, millions of genomes represented as variants are stored in databases. These databases are constantly updated and queried to extract information such as the common variants among individuals or groups of individuals. Previous algorithms for compression of this type of databases lack efficient random access capabilities, rendering querying the database for particular variants and/or individuals extremely inefficient, to the point where compression is often relinquished altogether.

Results: We present a new algorithm for this task, called GTRAC, that achieves significant compression ratios while allowing fast random access over the compressed database. For example, GTRAC is able to compress a Homo sapiens dataset containing 1092 samples in 1.1 GB (compression ratio of 160), while allowing for decompression of specific samples in less than a second and decompression of specific variants in 17 ms. GTRAC uses and adapts techniques from information theory, such as a specialized Lempel-Ziv compressor, and tailored succinct data structures.

Availability and implementation: The GTRAC algorithm is available for download at: https://github.com/kedartatwawadi/GTRAC CONTACT: : kedart@stanford.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
GTRAC allows for fast access of information of a specific variant or the genotype of a sample/group of samples over the compressed VCF file
Fig. 2.
Fig. 2.
Example of a binary matrix H representing the presence/absence of variants in the haplotypes. Note that in this representation, columns represent variant information across samples and rows represent haplotypes of a given sample
Fig. 3.
Fig. 3.
A simple example of the parsing process. The 3×20matrixH is first converted to the 3×10 symbol matrix H2 (K=2). Then the parsing is performed sequentially at each row (starting on the first row), from left to right
Fig. 4.
Fig. 4.
A simple example of rank and select usage corresponding to the phraseEnd vectors of example of Figure 3
Fig. 5.
Fig. 5.
The recursive algorithm for sub-string extraction from the compressed symbol matrix HK. The input parameters (vid,start,len) correspond to the row-id, the start position, and the length, respectively, of the sub-string we want to extract
Fig. 6.
Fig. 6.
The Recursive Algorithm for extraction of a specific variant with column-id cid, from all the rows

References

    1. Ball M.P. et al. (2012) A public resource facilitating clinical use of genomes. Proc. Natl. Acad. Sci. USA, 109, 11920–11927. - PMC - PubMed
    1. Consortium G.P. et al. (2012) An integrated map of genetic variation from 1,092 human genomes. Nature, 491, 56–65. - PMC - PubMed
    1. Danecek P. et al. (2011) The variant call format and vcftools. Bioinformatics, 27, 2156–2158. - PMC - PubMed
    1. Deorowicz S. et al. (2013) Genome compression: a novel approach for large collections. Bioinformatics, 29, 2572–2578. - PubMed
    1. Illumina I. (2015). TruGenome Clinical Sequencing Services. http://www.illumina.com/clinical/illumina_clinical_laboratory/trugenome-....