AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models

Milton Silva^{1

2}, Diogo Pratas^{1

2

3}, Armando J Pinho^{1

2}

Affiliations

¹ IEETA-Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, 3810-193 Aveiro, Portugal.
² Department of Electronics Telecommunications and Informatics, University of Aveiro, 3810-193 Aveiro, Portugal.
³ Department of Virology, University of Helsinki, 00014 Helsinki, Finland.

PMID: 33925812
PMCID: PMC8146440
DOI: 10.3390/e23050530

AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models

Milton Silva et al. Entropy (Basel). 2021.

. 2021 Apr 26;23(5):530.

doi: 10.3390/e23050530.

Authors

Milton Silva^{1

2}, Diogo Pratas^{1

2

3}, Armando J Pinho^{1

2}

Affiliations

¹ IEETA-Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, 3810-193 Aveiro, Portugal.
² Department of Electronics Telecommunications and Informatics, University of Aveiro, 3810-193 Aveiro, Portugal.
³ Department of Virology, University of Helsinki, 00014 Helsinki, Finland.

PMID: 33925812
PMCID: PMC8146440
DOI: 10.3390/e23050530

Abstract

Recently, the scientific community has witnessed a substantial increase in the generation of protein sequence data, triggering emergent challenges of increasing importance, namely efficient storage and improved data analysis. For both applications, data compression is a straightforward solution. However, in the literature, the number of specific protein sequence compressors is relatively low. Moreover, these specialized compressors marginally improve the compression ratio over the best general-purpose compressors. In this paper, we present AC2, a new lossless data compressor for protein (or amino acid) sequences. AC2 uses a neural network to mix experts with a stacked generalization approach and individual cache-hash memory models to the highest-context orders. Compared to the previous compressor (AC), we show gains of 2-9% and 6-7% in reference-free and reference-based modes, respectively. These gains come at the cost of three times slower computations. AC2 also improves memory usage against AC, with requirements about seven times lower, without being affected by the sequences' input size. As an analysis application, we use AC2 to measure the similarity between each SARS-CoV-2 protein sequence with each viral protein sequence from the whole UniProt database. The results consistently show higher similarity to the pangolin coronavirus, followed by the bat and human coronaviruses, contributing with critical results to a current controversial subject. AC2 is available for free download under GPLv3 license.

Keywords: context mixing; coronavirus; lossless data compression; mixture of experts; neural networks; protein sequence compression.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Figure 1**
Mixer architecture: high level overview of inputs to the neural network (mixer) used in AC2. *Model*₁ through *Model_n* represent the AC model outputs (probabilities for all the amino acids). *EMA* represents the exponential moving average for each symbol. *Freqs* are the frequencies for the last 8, 16, and 64 symbols. The network outputs represent the probabilities for each amino acid symbol.

**Figure 2**
Smoothed gain of AC2 relatively to AC, in bits per symbol (bps). Regions with the line above zero indicate that AC2 has better compression than AC. Three profiles are depicted for three species, namely (a) XV: *Xanthomonas* virus Xp10, (b) FV: *Fowlpox* virus, (c) HS: *Homo sapiens*.

**Figure 3**
Smoothed gain of AC2 relatively to AC in bits per symbol (Bps). Regions with the line above zero indicate that AC2 has better compression than AC. Three profiles are depicted for referential compression of three sequence pairs, namely (a) Chromosome 1 of chimpanzee, (b) Chromosome 17 of gorilla, and (c) Mitochondrion of orangutan. All target sequences use the corresponding human sequence as reference. The compression parameters are the same as in Table 2.

**Figure 4**
Analysis of the most similar protein sequences from the NCBI database according to multiple protein sequences of the SARS-CoV-2. The similarity metric used is the Normalized Compression Distance (NCD). The lower the NCD, higher the similarity. Five protein sequences are used for comparison: (a) membrane, (b) nucleoprotein, (c) envelope, (e) Replicase polyprotein (ORF 1ab), and (f) spike. The (d) panel depicts an illustration of the two-dimensional localization of the proteins in SARS-CoV-2, while (g) shows localization in one-dimension of the sequences that correspond to the proteins.

See this image and copyright information in PMC

Cited by

Diagnosis of Inflammatory Bowel Disease and Colorectal Cancer through Multi-View Stacked Generalization Applied on Gut Microbiome Data.
Imangaliyev S, Schlötterer J, Meyer F, Seifert C. Imangaliyev S, et al. Diagnostics (Basel). 2022 Oct 17;12(10):2514. doi: 10.3390/diagnostics12102514. Diagnostics (Basel). 2022. PMID: 36292203 Free PMC article.
AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data.
Silva JM, Qi W, Pinho AJ, Pratas D. Silva JM, et al. Gigascience. 2022 Dec 28;12:giad101. doi: 10.1093/gigascience/giad101. Epub 2023 Dec 13. Gigascience. 2022. PMID: 38091509 Free PMC article.
Bioinformatics tools for the sequence complexity estimates.
Orlov YL, Orlova NG. Orlov YL, et al. Biophys Rev. 2023 Sep 15;15(5):1367-1378. doi: 10.1007/s12551-023-01140-y. eCollection 2023 Oct. Biophys Rev. 2023. PMID: 37974990 Free PMC article. Review.

References

1. Dill K.A., MacCallum J.L. The protein-folding problem, 50 years on. Science. 2012;338:1042–1046. doi: 10.1126/science.1219021. - DOI - PubMed
1. Service R.F. ‘The game has changed.’ AI triumphs at protein folding. Science. 2020;370:1144–1145. doi: 10.1126/science.370.6521.1144. - DOI - PubMed
1. Golan A. Foundations of Info-Metrics: Modeling, Inference, and Imperfect Information. Oxford University Press; Oxford, UK: 2018.
1. Sayood K. Introduction to Data Compression. Morgan Kaufmann; San Francisco, CA, USA: 2017.
1. Baxevanis A.D., Bader G.D., Wishart D.S. Bioinformatics. John Wiley & Sons; Hoboken, NJ, USA: 2020.

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models

Affiliations

AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous