Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Apr 1;36(7):2275-2277.
doi: 10.1093/bioinformatics/btz922.

GABAC: an arithmetic coding solution for genomic data

Affiliations

GABAC: an arithmetic coding solution for genomic data

Jan Voges et al. Bioinformatics. .

Abstract

Motivation: In an effort to provide a response to the ever-expanding generation of genomic data, the International Organization for Standardization (ISO) is designing a new solution for the representation, compression and management of genomic sequencing data: the Moving Picture Experts Group (MPEG)-G standard. This paper discusses the first implementation of an MPEG-G compliant entropy codec: GABAC. GABAC combines proven coding technologies, such as context-adaptive binary arithmetic coding, binarization schemes and transformations, into a straightforward solution for the compression of sequencing data.

Results: We demonstrate that GABAC outperforms well-established (entropy) codecs in a significant set of cases and thus can serve as an extension for existing genomic compression solutions, such as CRAM.

Availability and implementation: The GABAC library is written in C++. We also provide a command line application which exercises all features provided by the library. GABAC can be downloaded from https://github.com/mitogen/gabac.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Rank of compression performance (L) and speed (R). Dots were jittered for clarity. The x-axes show the test set ID (01 and 02) from which the descriptor stream files were generated. The y-axes denote the actual ranks. Each dot depicts the ranking a codec achieved on one specific descriptor stream file. The red lines denote the mean ranks, averaged over both test items. (Color version of this figure is available at Bioinformatics online.)

References

    1. Bonfield J.K. (2014) The Scramble conversion tool. Bioinformatics, 30, 2818–2819. - PMC - PubMed
    1. Fritz M.H.-Y. et al. (2011) Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res., 21, 734–740. - PMC - PubMed
    1. Hach F. et al. (2014) DeeZ: reference-based compression by local assembly. Nat. Methods, 11, 1082–1084. - PubMed
    1. Marpe D. et al. (2003) Context-based adaptive binary arithmetic coding in the H.264/AVC video compression standard. IEEE Trans. Circuits Syst. Video Technol., 13, 620–636.

Publication types