Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Mar 28;24(1):121.
doi: 10.1186/s12859-023-05240-0.

GVC: efficient random access compression for gene sequence variations

Affiliations

GVC: efficient random access compression for gene sequence variations

Yeremia Gunawan Adhisantoso et al. BMC Bioinformatics. .

Abstract

Background: In recent years, advances in high-throughput sequencing technologies have enabled the use of genomic information in many fields, such as precision medicine, oncology, and food quality control. The amount of genomic data being generated is growing rapidly and is expected to soon surpass the amount of video data. The majority of sequencing experiments, such as genome-wide association studies, have the goal of identifying variations in the gene sequence to better understand phenotypic variations. We present a novel approach for compressing gene sequence variations with random access capability: the Genomic Variant Codec (GVC). We use techniques such as binarization, joint row- and column-wise sorting of blocks of variations, as well as the image compression standard JBIG for efficient entropy coding.

Results: Our results show that GVC provides the best trade-off between compression and random access compared to the state of the art: it reduces the genotype information size from 758 GiB down to 890 MiB on the publicly available 1000 Genomes Project (phase 3) data, which is 21% less than the state of the art in random-access capable methods.

Conclusions: By providing the best results in terms of combined random access and compression, GVC facilitates the efficient storage of large collections of gene sequence variations. In particular, the random access capability of GVC enables seamless remote data access and application integration. The software is open source and available at https://github.com/sXperfect/gvc/ .

Keywords: Compression; Random access; VCF; Variants.

PubMed Disclaimer

Conflict of interest statement

JV, CR, VT, YGA, and JO have filed the patent application DE102021100199A1, which covers parts of the methods presented in the manuscript.

Figures

Fig. 1
Fig. 1
Block diagram of the proposed encoding process. The genotype matrix G is processed by a series of transformations: splitting, binarization, and optionally sorting. At the end of the process, entropy coding is applied
Fig. 2
Fig. 2
Example for G=022/01/000. Bit plane binarization yields two bit planes representing the most significant bit and the less significant bit of A. Row binarization generates only three binary rows because the first row of A requires two bits and the second row of A requires only one bit
Fig. 3
Fig. 3
Average compression ratio (averaged over all chromosomes) achieved by each GVC configuration. The colors indicate the employed binarization: orange—bit plane binarization (concatenated), blue—bit plane binarization, green—row binarization. The patterns indicate the employed sorting: no pattern—no sorting, vertical lines—column sorting, horizontal lines—row sorting, both horizontal and vertical lines—sorting in both directions
Fig. 4
Fig. 4
Compression ratio achieved by each GVC configuration with different block sizes. For all configurations with the sorting enabled, increasing the block size increases the compression ratio
Fig. 5
Fig. 5
Comparison of GVC to the state-of-the-art methods GTC [4], GTRAC [3], and GTShark [5] with respect to compressed size
Fig. 6
Fig. 6
Comparison of random access time between GTC [4] and GVC with respect to the range size
Fig. 7
Fig. 7
Comparison of random access time between GTC [4] and GVC with respect to the number of samples
Fig. 8
Fig. 8
An example of a random access process on compressed genotypes where the number of alternate alleles is one and the blocks are transformed using bit plane binarization and sorted in row direction. A user needs the genotypes of all samples on chromosome 2 at loci 1000 through 1100, represented by “chr2:1000-1100”. First, GVC finds the blocks containing the required genotype information using a block lookup process. The bitstreams of the selected block, in this case the block with ID 1, are then decoded, yielding the sort indices a~ and the binary matrix B. Using the position information of each variant site, GVC selects certain rows or columns of the binary matrix B and based on the sort index a~. Finally, the selected rows and columns of the binary matrix B are then inverse transformed to return the genotypes of all samples at loci 1000 through 1100

References

    1. Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, Iyer R, Schatz MC, Sinha S, Robinson GE. Big data: astronomical or genomical? PLoS Biol. 2015;13(7):1002195. doi: 10.1371/journal.pbio.1002195. - DOI - PMC - PubMed
    1. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, et al. The variant call format and VCFtools. Bioinformatics. 2011;27(15):2156–2158. doi: 10.1093/bioinformatics/btr330. - DOI - PMC - PubMed
    1. Tatwawadi K, Hernaez M, Ochoa I, Weissman T. GTRAC: fast retrieval from compressed collections of genomic variants. Bioinformatics. 2016;32(17):479–486. doi: 10.1093/bioinformatics/btw437. - DOI - PMC - PubMed
    1. Danek A, Deorowicz S. GTC: how to maintain huge genotype collections in a compressed form. Bioinformatics. 2018;34(11):1834–1840. doi: 10.1093/bioinformatics/bty023. - DOI - PubMed
    1. Deorowicz S, Danek A. Gtshark: genotype compression in large projects. Bioinformatics. 2019;35(22):4791–4793. doi: 10.1093/bioinformatics/btz508. - DOI - PubMed

LinkOut - more resources