Cooler: scalable storage for Hi-C data and other genomically labeled arrays

Nezar Abdennur¹, Leonid A Mirny^{1

2}

Affiliations

¹ Institute for Medical Engineering and Science, Cambridge, MA 02139, USA.
² Department of Physics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.

PMID: 31290943
PMCID: PMC8205516
DOI: 10.1093/bioinformatics/btz540

Cooler: scalable storage for Hi-C data and other genomically labeled arrays

Nezar Abdennur et al. Bioinformatics. 2020.

. 2020 Jan 1;36(1):311-316.

doi: 10.1093/bioinformatics/btz540.

Authors

Nezar Abdennur¹, Leonid A Mirny^{1

2}

Affiliations

¹ Institute for Medical Engineering and Science, Cambridge, MA 02139, USA.
² Department of Physics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.

PMID: 31290943
PMCID: PMC8205516
DOI: 10.1093/bioinformatics/btz540

Abstract

Motivation: Most existing coverage-based (epi)genomic datasets are one-dimensional, but newer technologies probing interactions (physical, genetic, etc.) produce quantitative maps with two-dimensional genomic coordinate systems. Storage and computational costs mount sharply with data resolution when such maps are stored in dense form. Hence, there is a pressing need to develop data storage strategies that handle the full range of useful resolutions in multidimensional genomic datasets by taking advantage of their sparse nature, while supporting efficient compression and providing fast random access to facilitate development of scalable algorithms for data analysis.

Results: We developed a file format called cooler, based on a sparse data model, that can support genomically labeled matrices at any resolution. It has the flexibility to accommodate various descriptions of the data axes (genomic coordinates, tracks and bin annotations), resolutions, data density patterns and metadata. Cooler is based on HDF5 and is supported by a Python library and command line suite to create, read, inspect and manipulate cooler data collections. The format has been adopted as a standard by the NIH 4D Nucleome Consortium.

Availability and implementation: Cooler is cross-platform, BSD-licensed and can be installed from the Python package index or the bioconda repository. The source code is maintained on Github at https://github.com/mirnylab/cooler.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
Data model for GLSAs and cooler format. (a) Diagram of the GLSA data model. A multidimensional genomically labeled array can be represented via a decomposition that distinguishes the attributes describing the genomic intervals (table labeled *bins*) that make up the coordinates of the array’s axes from the actual non-zero elements of the array (table labeled *elements*). The element table contains one or more numerical *value columns* and simple integer coordinates that reference rows of the bin table (depicted using arrows). The bin table’s records describe a sequence of ordered, non-overlapping genomic intervals, minimally described by the reference sequence (*chrom*), and *start*, and *end* positions. The *chrom* column is further encoded as an integer enumeration to reference a third table labeled *chroms*, which contains attributes describing the reference sequences themselves, such as their genomic lengths. (b) Any selection of rows of the element table can be annotated by joining with the appropriate columns of the bin table. (c) For symmetric matrices, such as Hi-C maps, only upper triangular pixels are stored to eliminate duplication. Right, a diagram of a cooler data collection’s hierarchical structure. The three tables are modeled as HDF5 groups (depicted as folders) while the table columns are stored as 1D arrays, which are chunked and compressed internally by HDF5. A reserved set of metadata HDF5 attributes are associated with the root group of the data collection, including a property indicating whether the matrix is to be interpreted as symmetric

**Fig. 2.**
Cooler CLI and Python library. (a) Summary of the main categories of cooler commands available with the cooler Python package, illustrating the flow of data. The main operations include the ingestion of file or text streams to create new coolers, aggregation and coarsening of existing coolers to lower resolutions, merging of axis-compatible matrices, normalization of cooler matrices by iterative correction, utilities to serialize and stream out the data and metadata inside a cooler file and to process range queries and a lightweight viewer to visually inspect a matrix. For example, one uses either the load command to ingest pre-aggregated data already in matrix form or the cload command to aggregate paired tag records into a matrix. The genomic bin segmentation defining the axes of the matrix must be provided separately by providing either a path to a BED file or a path to a chromosome sizes file along with a specified fixed bin size. (b) The cooler Python library provides a Cooler class that exposes data *range selectors* to facilitate data retrieval and analysis. The individual *chrom*, *bin* and *pixel* tables are accessible using 1D range selectors that accept column and row-range selections and yield pandas data frame output. A cooler’s matrix values are also exposed using a 2D range selector that processes rectangular range queries specified either by a pair of genomic coordinate intervals in UCSC-style notation (using the fetch method) or as integer matrix coordinates (using Python slice syntax). The retrieved 2D range data may be materialized as dense NumPy arrays, sparse matrices or data frames. For symmetric coolers, the file’s upper triangular data will be appropriately mirrored in the array and sparse matrix outputs

See this image and copyright information in PMC

References

1. Abadi D.J. et al. (2008) Column-stores vs. row-stores: how different are they really? In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, pp. 967–980. ACM, New York, NY, USA.
1. Akdemir K.C., Chin L. (2015) HiCPlotter integrates genomic data with interaction matrices. Genome Biol., 16, 198.. - PMC - PubMed
1. Collette A. (2013) Python and HDF5: Unlocking Scientific Data. O’Reilly.
1. da Veiga Leprevost F. et al. (2017) BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics, 33, 2580–2582. - PMC - PubMed
1. Davies J.O. et al. (2017) How best to identify chromosomal interactions: a comparison of approaches. Nat. Methods, 14, 125. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Cooler: scalable storage for Hi-C data and other genomically labeled arrays

Affiliations

Cooler: scalable storage for Hi-C data and other genomically labeled arrays

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources