AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models
- PMID: 33925812
- PMCID: PMC8146440
- DOI: 10.3390/e23050530
AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models
Abstract
Recently, the scientific community has witnessed a substantial increase in the generation of protein sequence data, triggering emergent challenges of increasing importance, namely efficient storage and improved data analysis. For both applications, data compression is a straightforward solution. However, in the literature, the number of specific protein sequence compressors is relatively low. Moreover, these specialized compressors marginally improve the compression ratio over the best general-purpose compressors. In this paper, we present AC2, a new lossless data compressor for protein (or amino acid) sequences. AC2 uses a neural network to mix experts with a stacked generalization approach and individual cache-hash memory models to the highest-context orders. Compared to the previous compressor (AC), we show gains of 2-9% and 6-7% in reference-free and reference-based modes, respectively. These gains come at the cost of three times slower computations. AC2 also improves memory usage against AC, with requirements about seven times lower, without being affected by the sequences' input size. As an analysis application, we use AC2 to measure the similarity between each SARS-CoV-2 protein sequence with each viral protein sequence from the whole UniProt database. The results consistently show higher similarity to the pangolin coronavirus, followed by the bat and human coronaviruses, contributing with critical results to a current controversial subject. AC2 is available for free download under GPLv3 license.
Keywords: context mixing; coronavirus; lossless data compression; mixture of experts; neural networks; protein sequence compression.
Conflict of interest statement
The authors declare that they have no competing interests.
Figures




Similar articles
-
Efficient DNA sequence compression with neural networks.Gigascience. 2020 Nov 11;9(11):giaa119. doi: 10.1093/gigascience/giaa119. Gigascience. 2020. PMID: 33179040 Free PMC article.
-
Sequence Compression Benchmark (SCB) database-A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences.Gigascience. 2020 Jul 1;9(7):giaa072. doi: 10.1093/gigascience/giaa072. Gigascience. 2020. PMID: 32627830 Free PMC article.
-
LCQS: an efficient lossless compression tool of quality scores with random access functionality.BMC Bioinformatics. 2020 Mar 18;21(1):109. doi: 10.1186/s12859-020-3428-7. BMC Bioinformatics. 2020. PMID: 32183707 Free PMC article.
-
Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences.Bioinformatics. 2019 Oct 1;35(19):3826-3828. doi: 10.1093/bioinformatics/btz144. Bioinformatics. 2019. PMID: 30799504 Free PMC article.
-
Manipulation of the Plant Host by the Geminivirus AC2/C2 Protein, a Central Player in the Infection Cycle.Front Plant Sci. 2020 May 19;11:591. doi: 10.3389/fpls.2020.00591. eCollection 2020. Front Plant Sci. 2020. PMID: 32508858 Free PMC article. Review.
Cited by
-
Diagnosis of Inflammatory Bowel Disease and Colorectal Cancer through Multi-View Stacked Generalization Applied on Gut Microbiome Data.Diagnostics (Basel). 2022 Oct 17;12(10):2514. doi: 10.3390/diagnostics12102514. Diagnostics (Basel). 2022. PMID: 36292203 Free PMC article.
-
AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data.Gigascience. 2022 Dec 28;12:giad101. doi: 10.1093/gigascience/giad101. Epub 2023 Dec 13. Gigascience. 2022. PMID: 38091509 Free PMC article.
-
Bioinformatics tools for the sequence complexity estimates.Biophys Rev. 2023 Sep 15;15(5):1367-1378. doi: 10.1007/s12551-023-01140-y. eCollection 2023 Oct. Biophys Rev. 2023. PMID: 37974990 Free PMC article. Review.
References
-
- Golan A. Foundations of Info-Metrics: Modeling, Inference, and Imperfect Information. Oxford University Press; Oxford, UK: 2018.
-
- Sayood K. Introduction to Data Compression. Morgan Kaufmann; San Francisco, CA, USA: 2017.
-
- Baxevanis A.D., Bader G.D., Wishart D.S. Bioinformatics. John Wiley & Sons; Hoboken, NJ, USA: 2020.
LinkOut - more resources
Full Text Sources
Research Materials
Miscellaneous