Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Apr 18;17(4):e0265360.
doi: 10.1371/journal.pone.0265360. eCollection 2022.

CHAPAO: Likelihood and hierarchical reference-based representation of biomolecular sequences and applications to compressing multiple sequence alignments

Affiliations

CHAPAO: Likelihood and hierarchical reference-based representation of biomolecular sequences and applications to compressing multiple sequence alignments

Md Ashiqur Rahman et al. PLoS One. .

Abstract

Background: High-throughput experimental technologies are generating tremendous amounts of genomic data, offering valuable resources to answer important questions and extract biological insights. Storing this sheer amount of genomic data has become a major concern in bioinformatics. General purpose compression techniques (e.g. gzip, bzip2, 7-zip) are being widely used due to their pervasiveness and relatively good speed. However, they are not customized for genomic data and may fail to leverage special characteristics and redundancy of the biomolecular sequences.

Results: We present a new lossless compression method CHAPAO (COmpressing Alignments using Hierarchical and Probabilistic Approach), which is especially designed for multiple sequence alignments (MSAs) of biomolecular data and offers very good compression gain. We have introduced a novel hierarchical referencing technique to represent biomolecular sequences which combines likelihood based analyses of the sequence similarities and graph theoretic algorithms. We performed an extensive evaluation study using a collection of real biological data from the avian phylogenomics project, 1000 plants project (1KP), and 16S and 23S rRNA datasets. We report the performance of CHAPAO in comparison with general purpose compression techniques as well as with MFCompress and Nucleotide Archival Format (NAF)-two of the best known methods especially designed for FASTA files. Experimental results suggest that CHAPAO offers significant improvements in compression gain over most other alternative methods. CHAPAO is freely available as an open source software at https://github.com/ashiq24/CHAPAO.

Conclusion: CHAPAO advances the state-of-the-art in compression algorithms and represents a potential alternative to the general purpose compression techniques as well as to the existing specialized compression techniques for biomolecular sequences.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Character evolution and multiple sequence alignment.
(a) Two observed sequences, (b) Character evolution with substitution and indels which can change the sequence length and blur the homology, and (c) Multiple sequence alignment of the two sequences capturing the underlying character evolution where each site consists of homologous characters.
Fig 2
Fig 2. Overview of the compression and decompression techniques in CHAPAO.
A directed weighted “Encodability Graph” is constructed where each vertex corresponds to a sequence in MSA except for the dummy node (shown in red) which is used as a “source” vertex. Next, a minimum spanning arborescence MA in the graph is constructed. Sequences that are children of the dummy node in the MA will be used as reference sequences. Appropriate metadata are generated to hierarchically represent all other sequences. Finally, the reference sequences along with the metadata are compressed using existing compression techniques. The pipeline is completely reversible, allowing lossless decompression of the original MSAs.
Fig 3
Fig 3. Directed graph based modeling.
(a) A multiple sequence alignment with three sequences, (b) the corresponding cost matrix, (c) the encodability graph EG, and (d) the corresponding minimum spanning arborescence MA.
Fig 4
Fig 4. Performance of various compression techniques on avian datasets.
To better understand the relative performance of different methods across different file sizes, we distribute the MSA files into various bins based on their sizes. For each bin (file-size range), we show the average size of the compressed files produced by various methods. (a) UCEs. (b) Introns. (c) Exons.
Fig 5
Fig 5. Performance of various compression techniques on 10 concatenated alignments in avian dataset.
Fig 6
Fig 6. Performance of CHAPAO with varying levels of dissimilarity/divergence of the 14,490 MSAs in avian datasets.
The average hamming distance (as defined in Eq 4) of these files ranges from 0–13. The box plots show the compression ratio (ratio of the size of the original file and the compressed file) of CHAPAO on the MSAs in avian datasets (here MSAs are sorted in an ascending order of their average hamming distance).
Fig 7
Fig 7. Comparison of various compression techniques on 16S and 23S datasets and 1KP dataset.
(a) 16S. (b) 23S. (c) 1KP.
Fig 8
Fig 8. Impact of the lengths of sliding window and overlap on compression ratio.
We show the performance of various variants of CHAPAO on 16S and 23S datasets. (a) 16S. (b) 23S.

References

    1. Loh PR, Baym M, Berger B. Compressive genomics. Nature Biotechnology. 2012;30(7):627. doi: 10.1038/nbt.2241 - DOI - PubMed
    1. Berger B, Peng J, Singh M. Computational solutions for omics data. Nature Reviews Genetics. 2013;14(5):333. doi: 10.1038/nrg3433 - DOI - PMC - PubMed
    1. Deutsch P. RFC 1952: GZIP file format specification version 4.3. Internet Engineering Task Force. 1996;.
    1. Burrows M, Wheeler DJ. A block-sorting lossless data compression algorithm. 1994;.
    1. Pavlov I. 7zip file archive application. Available from: https://www.7-zip.org.