Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Mar 4;38(6):1497-1503.
doi: 10.1093/bioinformatics/btac010.

CRAM 3.1: advances in the CRAM file format

Affiliations

CRAM 3.1: advances in the CRAM file format

James K Bonfield. Bioinformatics. .

Abstract

Motivation: CRAM has established itself as a high compression alternative to the BAM file format for DNA sequencing data. We describe updates to further improve this on modern sequencing instruments.

Results: With Illumina data CRAM 3.1 is 7-15% smaller than the equivalent CRAM 3.0 file, and 50-70% smaller than the corresponding BAM file. Long-read technology shows more modest compression due to the presence of high-entropy signals.

Availability and implementation: The CRAM 3.0 specification is freely available from https://samtools.github.io/hts-specs/CRAMv3.pdf. The CRAM 3.1 improvements are available in a separate OpenSource HTScodecs library from https://github.com/samtools/htscodecs, and have been incorporated into HTSlib.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Logical layout of a CRAM file, showing containers and slices as rows, and data series as columns. Random access is possible on rows, with rapid filtering (discarding) of columns
Fig. 2.
Fig. 2.
An example FQZComp configuration describing how previous quality values, the position in the sequence, a running sum of the quality differences (delta) and a generic model selector can be combined with lookup tables to generate a context model
Fig. 3.
Fig. 3.
Benchmarks of aligned data formats using 12 threads. MPEG-G figures are taken from the Voges et al. paper, with ‘MPEG-G (est.)’ possibly using a slightly different input file (see text). Genozip 12 and Genozip 13 refer to versions 12.0.34 and 13.0.5, respectively, with the latter being released after the initial preprint publication and during manuscript review. Parts A, B and C show results for Human sample NA12878 sequenced using Illumina HiSeq2000, NovaSeq and PacBio CLR respectively.

References

    1. Bliss B. et al. (2018) Genie: an MPEG-G conformant software to compress genomic data.
    1. Bonfield J.K. (2014) The scramble conversion tool. Bioinformatics, 30, 2818–2819. - PMC - PubMed
    1. Bonfield J.K., Mahoney M.V. (2013) Compression of FASTQ and SAM format sequencing data. PLoS One, 8, e59190. - PMC - PubMed
    1. Bonfield J.K. et al. (2019) Crumble: reference free lossy compression of sequence quality values. Bioinformatics, 35, 337–339. - PMC - PubMed
    1. Bonfield J.K. et al. (2021) HTSlib: C library for reading/writing high-throughput sequencing data. Gigascience, 10, giab007. - PMC - PubMed

Publication types