Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 May 2;40(5):btae323.
doi: 10.1093/bioinformatics/btae323.

PQSDC: a parallel lossless compressor for quality scores data via sequences partition and run-length prediction mapping

Affiliations

PQSDC: a parallel lossless compressor for quality scores data via sequences partition and run-length prediction mapping

Hui Sun et al. Bioinformatics. .

Abstract

Motivation: The quality scores data (QSD) account for 70% in compressed FastQ files obtained from the short and long reads sequencing technologies. Designing effective compressors for QSD that counterbalance compression ratio, time cost, and memory consumption is essential in scenarios such as large-scale genomics data sharing and long-term data backup. This study presents a novel parallel lossless QSD-dedicated compression algorithm named PQSDC, which fulfills the above requirements well. PQSDC is based on two core components: a parallel sequences-partition model designed to reduce peak memory consumption and time cost during compression and decompression processes, as well as a parallel four-level run-length prediction mapping model to enhance compression ratio. Besides, the PQSDC algorithm is also designed to be highly concurrent using multicore CPU clusters.

Results: We evaluate PQSDC and four state-of-the-art compression algorithms on 27 real-world datasets, including 61.857 billion QSD characters and 632.908 million QSD sequences. (1) For short reads, compared to baselines, the maximum improvement of PQSDC reaches 7.06% in average compression ratio, and 8.01% in weighted average compression ratio. During compression and decompression, the maximum total time savings of PQSDC are 79.96% and 84.56%, respectively; the maximum average memory savings are 68.34% and 77.63%, respectively. (2) For long reads, the maximum improvement of PQSDC reaches 12.51% and 13.42% in average and weighted average compression ratio, respectively. The maximum total time savings during compression and decompression are 53.51% and 72.53%, respectively; the maximum average memory savings are 19.44% and 17.42%, respectively. (3) Furthermore, PQSDC ranks second in compression robustness among the tested algorithms, indicating that it is less affected by the probability distribution of the QSD collections. Overall, our work provides a promising solution for QSD parallel compression, which balances storage cost, time consumption, and memory occupation primely.

Availability and implementation: The proposed PQSDC compressor can be downloaded from https://github.com/fahaihi/PQSDC.

PubMed Disclaimer

Conflict of interest statement

No potential conflict of interest was reported by the authors.

Figures

Figure 1.
Figure 1.
The overall compression workflow of the proposed PQSDC compressor. Examples of parallel PSPM, PRPM, and ZPAQ can be found in the Supplementary Figs S1–S3. The decompression pipeline is the reverse process of the procedure aforementioned.

Similar articles

References

    1. Bonfield JK. The scramble conversion tool. Bioinformatics 2014;30:2818–9. - PMC - PubMed
    1. Bonfield JK, Mahoney MV.. Compression of FASTQ and sam format sequencing data. PLoS One 2013;8:e59190. - PMC - PubMed
    1. Bonfield JK, McCarthy SA, Durbin R. et al. Crumble: reference free lossy compression of sequence quality values. Bioinformatics 2019;35:337–9. - PMC - PubMed
    1. Cánovas R, Moffat A, Turpin A. et al. Lossy compression of quality scores in genomic data. Bioinformatics 2014;30:2130–6. - PubMed
    1. Chandak S, Tatwawadi K, Ochoa I. et al. Spring: a next-generation compressor for FASTQ data. Bioinformatics 2019;35:2674–6. - PMC - PubMed

Publication types