Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jan 1;36(1):311-316.
doi: 10.1093/bioinformatics/btz540.

Cooler: scalable storage for Hi-C data and other genomically labeled arrays

Affiliations

Cooler: scalable storage for Hi-C data and other genomically labeled arrays

Nezar Abdennur et al. Bioinformatics. .

Abstract

Motivation: Most existing coverage-based (epi)genomic datasets are one-dimensional, but newer technologies probing interactions (physical, genetic, etc.) produce quantitative maps with two-dimensional genomic coordinate systems. Storage and computational costs mount sharply with data resolution when such maps are stored in dense form. Hence, there is a pressing need to develop data storage strategies that handle the full range of useful resolutions in multidimensional genomic datasets by taking advantage of their sparse nature, while supporting efficient compression and providing fast random access to facilitate development of scalable algorithms for data analysis.

Results: We developed a file format called cooler, based on a sparse data model, that can support genomically labeled matrices at any resolution. It has the flexibility to accommodate various descriptions of the data axes (genomic coordinates, tracks and bin annotations), resolutions, data density patterns and metadata. Cooler is based on HDF5 and is supported by a Python library and command line suite to create, read, inspect and manipulate cooler data collections. The format has been adopted as a standard by the NIH 4D Nucleome Consortium.

Availability and implementation: Cooler is cross-platform, BSD-licensed and can be installed from the Python package index or the bioconda repository. The source code is maintained on Github at https://github.com/mirnylab/cooler.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Data model for GLSAs and cooler format. (a) Diagram of the GLSA data model. A multidimensional genomically labeled array can be represented via a decomposition that distinguishes the attributes describing the genomic intervals (table labeled bins) that make up the coordinates of the array’s axes from the actual non-zero elements of the array (table labeled elements). The element table contains one or more numerical value columns and simple integer coordinates that reference rows of the bin table (depicted using arrows). The bin table’s records describe a sequence of ordered, non-overlapping genomic intervals, minimally described by the reference sequence (chrom), and start, and end positions. The chrom column is further encoded as an integer enumeration to reference a third table labeled chroms, which contains attributes describing the reference sequences themselves, such as their genomic lengths. (b) Any selection of rows of the element table can be annotated by joining with the appropriate columns of the bin table. (c) For symmetric matrices, such as Hi-C maps, only upper triangular pixels are stored to eliminate duplication. Right, a diagram of a cooler data collection’s hierarchical structure. The three tables are modeled as HDF5 groups (depicted as folders) while the table columns are stored as 1D arrays, which are chunked and compressed internally by HDF5. A reserved set of metadata HDF5 attributes are associated with the root group of the data collection, including a property indicating whether the matrix is to be interpreted as symmetric
Fig. 2.
Fig. 2.
Cooler CLI and Python library. (a) Summary of the main categories of cooler commands available with the cooler Python package, illustrating the flow of data. The main operations include the ingestion of file or text streams to create new coolers, aggregation and coarsening of existing coolers to lower resolutions, merging of axis-compatible matrices, normalization of cooler matrices by iterative correction, utilities to serialize and stream out the data and metadata inside a cooler file and to process range queries and a lightweight viewer to visually inspect a matrix. For example, one uses either the load command to ingest pre-aggregated data already in matrix form or the cload command to aggregate paired tag records into a matrix. The genomic bin segmentation defining the axes of the matrix must be provided separately by providing either a path to a BED file or a path to a chromosome sizes file along with a specified fixed bin size. (b) The cooler Python library provides a Cooler class that exposes data range selectors to facilitate data retrieval and analysis. The individual chrom, bin and pixel tables are accessible using 1D range selectors that accept column and row-range selections and yield pandas data frame output. A cooler’s matrix values are also exposed using a 2D range selector that processes rectangular range queries specified either by a pair of genomic coordinate intervals in UCSC-style notation (using the fetch method) or as integer matrix coordinates (using Python slice syntax). The retrieved 2D range data may be materialized as dense NumPy arrays, sparse matrices or data frames. For symmetric coolers, the file’s upper triangular data will be appropriately mirrored in the array and sparse matrix outputs

References

    1. Abadi D.J. et al. (2008) Column-stores vs. row-stores: how different are they really? In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, pp. 967–980. ACM, New York, NY, USA.
    1. Akdemir K.C., Chin L. (2015) HiCPlotter integrates genomic data with interaction matrices. Genome Biol., 16, 198.. - PMC - PubMed
    1. Collette A. (2013) Python and HDF5: Unlocking Scientific Data. O’Reilly.
    1. da Veiga Leprevost F. et al. (2017) BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics, 33, 2580–2582. - PMC - PubMed
    1. Davies J.O. et al. (2017) How best to identify chromosomal interactions: a comparison of approaches. Nat. Methods, 14, 125. - PubMed

Publication types