Enhancing Data Compression: Recent Innovations in LZ77 Algorithms
- PMID: 40444293
- PMCID: PMC12409268
- DOI: 10.1089/cmb.2024.0879
Enhancing Data Compression: Recent Innovations in LZ77 Algorithms
Abstract
The growing volume of genomic data, driven by advances in sequencing technologies, demands efficient data compression solutions. Traditional algorithms, such as Lempel-Ziv77 (LZ77), have been fundamental in offering lossless compression, yet they often fall short when applied to the highly repetitive structures typical of genomic sequences. This review explores the evolution of LZ77 and its adaptations for genomic data compression, highlighting specialized algorithms designed to handle redundancy in large-scale sequencing datasets efficiently. Innovations in this field have enhanced compression ratios and processing efficiencies leveraging intrinsic redundancy within genomic datasets. We critically examine a spectrum of LZ77-based algorithms, including newer adaptations for external and semi-external memory settings, and contrast their efficacy in managing large-scale genomic data. We conducted experiments to evaluate the performance of several algorithms, including KKP2, RLE-LZ, SE-KKP, BGone, and PFP-LZ77, on both real-world datasets from the Pizza&Chili repetitive corpus, Salmonella genomes, and human chromosome 19 genomes. These results underscore the trade-offs between time and memory consumption between algorithms. This article aims to provide a comprehensive guide on the current landscape and future directions of data compression technologies, equipping bioinformaticians and other practitioners with insight to tackle the escalating data challenges in genomics and beyond.
Keywords: Burrows–Wheeler transform; LZ77; data compression; data structures; suffix array.
References
-
- Bingmann T, Fischer J, Osipov V, et al. Inducing Suffix and LCP Arrays in External Memory. ACM J Exp Algorithmics 2016;21:1–27.
-
- Büren F, et al. Suffix Array Construction on Multi-GPU Systems. In: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing. 2019; pp. 183–194.
-
- Burrows M, Wheeler DJ. A Block-Sorting Lossless Data Compression Algorithm. Technical Report 0769518966, Digital Equipment Corporation; 1994.
-
- Castelli R, et al. Lossless compression of nanopore sequencing raw signals. In: Proceedings of the Bioinformatics and Biomedical Engineering (IWBBIO). Springer Nature; 2024; pp. 130–141.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
