Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jan 2;40(1):btad767.
doi: 10.1093/bioinformatics/btad767.

A fast machine learning dataloader for epigenetic tracks from BigWig files

Affiliations

A fast machine learning dataloader for epigenetic tracks from BigWig files

Joren Sebastian Retel et al. Bioinformatics. .

Abstract

Summary: We created bigwig-loader, a data-loader for epigenetic profiles from BigWig files that decompresses and processes information for multiple intervals from multiple BigWig files in parallel. This is an access pattern needed to create training batches for typical machine learning models on epigenetics data. Using a new codec, the decompression can be done on a graphical processing unit (GPU) making it fast enough to create the training batches during training, mitigating the need for saving preprocessed training examples to disk.

Availability and implementation: The bigwig-loader installation instructions and source code can be accessed at https://github.com/pfizer-opensource/bigwig-loader.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
(A) Overview of the dataloading process in the bigwig-loader library. A batch of fixed length intervals are sampled from a general set of (larger) regions of interest. This can, for instance, be all regions used in either the train, validation, or test split. The library also contains functionality to create such regions of interest based on a value threshold. For the sampled intervals, relevant compressed chunks from all bigwig files are pulled from disk, decompressed and converted to a value tensor. Additionally genomic sequences are loaded and optionally one-hot encoded, so that now both input and target tensors are available for the typical supervised machine learning methods developed for this type of data. Machine learning models (right bottom) or code to train them are not part of this library. (B) Comparison between throughput of pyBigWig using multiple CPU’s and bigwig-loader. The number of samples pyBigWig can load per second is only dependent on the number of CPU cores used, not on the batch size. Also note that the relationship between the number of CPU’s and data throughput is not linear because multiprocessing has an overhead. When just a few samples are needed, pyBigWig is faster. When more than a few training examples are needed, as is the case for machine learning applications, bigwig-loader is the faster alternative.

References

    1. Abadi M, Agarwal A, Barham P. et al. TensorFlow: large-scale machine learning on heterogeneous systems. Software available from tensorflow.org. 2015.
    1. Avsec Ž, Weilert M, Shrikumar A. et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat Genet 2021;53:354–66. - PMC - PubMed
    1. Chen KM, Cofer EM, Zhou J. et al. Selene: a PyTorch-based deep learning library for sequence data. Nat Methods 2019;16:315–8. - PMC - PubMed
    1. Kelley DR, Snoek J, Rinn JL.. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res 2016;26:990–9. - PMC - PubMed
    1. Kelley DR, Reshef Y, Bileschi M. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 2018. https://genome.cshlp.org/content/early/2018/03/27/gr.227819.117.full.pdf.... - PMC - PubMed