Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Apr 26;23(5):530.
doi: 10.3390/e23050530.

AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models

Affiliations

AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models

Milton Silva et al. Entropy (Basel). .

Abstract

Recently, the scientific community has witnessed a substantial increase in the generation of protein sequence data, triggering emergent challenges of increasing importance, namely efficient storage and improved data analysis. For both applications, data compression is a straightforward solution. However, in the literature, the number of specific protein sequence compressors is relatively low. Moreover, these specialized compressors marginally improve the compression ratio over the best general-purpose compressors. In this paper, we present AC2, a new lossless data compressor for protein (or amino acid) sequences. AC2 uses a neural network to mix experts with a stacked generalization approach and individual cache-hash memory models to the highest-context orders. Compared to the previous compressor (AC), we show gains of 2-9% and 6-7% in reference-free and reference-based modes, respectively. These gains come at the cost of three times slower computations. AC2 also improves memory usage against AC, with requirements about seven times lower, without being affected by the sequences' input size. As an analysis application, we use AC2 to measure the similarity between each SARS-CoV-2 protein sequence with each viral protein sequence from the whole UniProt database. The results consistently show higher similarity to the pangolin coronavirus, followed by the bat and human coronaviruses, contributing with critical results to a current controversial subject. AC2 is available for free download under GPLv3 license.

Keywords: context mixing; coronavirus; lossless data compression; mixture of experts; neural networks; protein sequence compression.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Figure 1
Figure 1
Mixer architecture: high level overview of inputs to the neural network (mixer) used in AC2. Model1 through Modeln represent the AC model outputs (probabilities for all the amino acids). EMA represents the exponential moving average for each symbol. Freqs are the frequencies for the last 8, 16, and 64 symbols. The network outputs represent the probabilities for each amino acid symbol.
Figure 2
Figure 2
Smoothed gain of AC2 relatively to AC, in bits per symbol (bps). Regions with the line above zero indicate that AC2 has better compression than AC. Three profiles are depicted for three species, namely (a) XV: Xanthomonas virus Xp10, (b) FV: Fowlpox virus, (c) HS: Homo sapiens.
Figure 3
Figure 3
Smoothed gain of AC2 relatively to AC in bits per symbol (Bps). Regions with the line above zero indicate that AC2 has better compression than AC. Three profiles are depicted for referential compression of three sequence pairs, namely (a) Chromosome 1 of chimpanzee, (b) Chromosome 17 of gorilla, and (c) Mitochondrion of orangutan. All target sequences use the corresponding human sequence as reference. The compression parameters are the same as in Table 2.
Figure 4
Figure 4
Analysis of the most similar protein sequences from the NCBI database according to multiple protein sequences of the SARS-CoV-2. The similarity metric used is the Normalized Compression Distance (NCD). The lower the NCD, higher the similarity. Five protein sequences are used for comparison: (a) membrane, (b) nucleoprotein, (c) envelope, (e) Replicase polyprotein (ORF 1ab), and (f) spike. The (d) panel depicts an illustration of the two-dimensional localization of the proteins in SARS-CoV-2, while (g) shows localization in one-dimension of the sequences that correspond to the proteins.

Similar articles

Cited by

References

    1. Dill K.A., MacCallum J.L. The protein-folding problem, 50 years on. Science. 2012;338:1042–1046. doi: 10.1126/science.1219021. - DOI - PubMed
    1. Service R.F. ‘The game has changed.’ AI triumphs at protein folding. Science. 2020;370:1144–1145. doi: 10.1126/science.370.6521.1144. - DOI - PubMed
    1. Golan A. Foundations of Info-Metrics: Modeling, Inference, and Imperfect Information. Oxford University Press; Oxford, UK: 2018.
    1. Sayood K. Introduction to Data Compression. Morgan Kaufmann; San Francisco, CA, USA: 2017.
    1. Baxevanis A.D., Bader G.D., Wishart D.S. Bioinformatics. John Wiley & Sons; Hoboken, NJ, USA: 2020.