Comparative Study

. 2007 Jul 13:8:252.

doi: 10.1186/1471-2105-8-252.

Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment

Paolo Ferragina¹, Raffaele Giancarlo, Valentina Greco, Giovanni Manzini, Gabriel Valiente

Affiliations

PMID: 17629909
PMCID: PMC1939857
DOI: 10.1186/1471-2105-8-252

Comparative Study

Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment

Paolo Ferragina et al. BMC Bioinformatics. 2007.

. 2007 Jul 13:8:252.

doi: 10.1186/1471-2105-8-252.

Authors

Paolo Ferragina¹, Raffaele Giancarlo, Valentina Greco, Giovanni Manzini, Gabriel Valiente

Affiliation

¹ Dipartimento di Matematica Applicazioni, Università di Palermo, Italy. ferragin@di.unipi.it <ferragin@di.unipi.it>

PMID: 17629909
PMCID: PMC1939857
DOI: 10.1186/1471-2105-8-252

Abstract

Background: Similarity of sequences is a key mathematical notion for Classification and Phylogenetic studies in Biology. It is currently primarily handled using alignments. However, the alignment methods seem inadequate for post-genomic studies since they do not scale well with data set size and they seem to be confined only to genomic and proteomic sequences. Therefore, alignment-free similarity measures are actively pursued. Among those, USM (Universal Similarity Metric) has gained prominence. It is based on the deep theory of Kolmogorov Complexity and universality is its most novel striking feature. Since it can only be approximated via data compression, USM is a methodology rather than a formula quantifying the similarity of two strings. Three approximations of USM are available, namely UCD (Universal Compression Dissimilarity), NCD (Normalized Compression Dissimilarity) and CD (Compression Dissimilarity). Their applicability and robustness is tested on various data sets yielding a first massive quantitative estimate that the USM methodology and its approximations are of value. Despite the rich theory developed around USM, its experimental assessment has limitations: only a few data compressors have been tested in conjunction with USM and mostly at a qualitative level, no comparison among UCD, NCD and CD is available and no comparison of USM with existing methods, both based on alignments and not, seems to be available.

Results: We experimentally test the USM methodology by using 25 compressors, all three of its known approximations and six data sets of relevance to Molecular Biology. This offers the first systematic and quantitative experimental assessment of this methodology, that naturally complements the many theoretical and the preliminary experimental results available. Moreover, we compare the USM methodology both with methods based on alignments and not. We may group our experiments into two sets. The first one, performed via ROC (Receiver Operating Curve) analysis, aims at assessing the intrinsic ability of the methodology to discriminate and classify biological sequences and structures. A second set of experiments aims at assessing how well two commonly available classification algorithms, UPGMA (Unweighted Pair Group Method with Arithmetic Mean) and NJ (Neighbor Joining), can use the methodology to perform their task, their performance being evaluated against gold standards and with the use of well known statistical indexes, i.e., the F-measure and the partition distance. Based on the experiments, several conclusions can be drawn and, from them, novel valuable guidelines for the use of USM on biological data. The main ones are reported next.

Conclusion: UCD and NCD are indistinguishable, i.e., they yield nearly the same values of the statistical indexes we have used, accross experiments and data sets, while CD is almost always worse than both. UPGMA seems to yield better classification results with respect to NJ, i.e., better values of the statistical indexes (10% difference or above), on a substantial fraction of experiments, compressors and USM approximation choices. The compression program PPMd, based on PPM (Prediction by Partial Matching), for generic data and Gencompress for DNA, are the best performers among the compression algorithms we have used, although the difference in performance, as measured by statistical indexes, between them and the other algorithms depends critically on the data set and may not be as large as expected. PPMd used with UCD or NCD and UPGMA, on sequence data is very close, although worse, in performance with the alignment methods (less than 2% difference on the F-measure). Yet, it scales well with data set size and it can work on data other than sequences. In summary, our quantitative analysis naturally complements the rich theory behind USM and supports the conclusion that the methodology is worth using because of its robustness, flexibility, scalability, and competitiveness with existing techniques. In particular, the methodology applies to all biological data in textual format. The software and data sets are available under the GNU GPL at the supplementary material web page.

PubMed Disclaimer

Figures

**Figure 1**
**Alternative representations**. PDB protein domain 1hlm00, a globin from the sea cucumber *Caudina arenicola*, the only protein common to the Chew-Kedem data set (CK-36-PDB and SP-86-PDB: amino acid sequence in FASTA format; CK-36-REL: complete TOPS string, with contact map; CK-36-SEQ: TOPS string of secondary structure elements) and Sierk-Pearson data set (SP-86-ATOM: ATOM lines from the PDB entry).

**Figure 2**
**ROC curves for CK-36-PDB**. ROC curves for the CK-36-PDB data set, one for each classification task (class, architecture, topology) and each measure (UCD, NCD, CD). Only the three algorithms with highest (green) and lowest (red) AUC values are shown.

**Figure 3**
**ROC curves for CK-36-REL**. ROC curves for the CK-36-REL data set, one for each classification task (class, architecture, topology) and each measure (UCD, NCD, CD). Only the three algorithms with highest (green) and lowest (red) AUC values are shown.

**Figure 4**
**ROC curves for CK-36-SEQ**. ROC curves for the CK-36-SEQ data set, one for each classification task (class, architecture, topology) and each measure (UCD, NCD, CD). Only the three algorithms with highest (green) and lowest (red) AUC values are shown.

**Figure 5**
**ROC curves for SP-86-PDB**. ROC curves for the SP-86-PDB data set, one for each classification task (class, architecture, topology) and each measure (UCD, NCD, CD). Only the three algorithms with highest (green) and lowest (red) AUC values are shown.

**Figure 6**
**ROC curves for SP-86-ATOM**. ROC curves for the SP-86-ATOM data set, one for each classification task (class, architecture, topology) and each measure (UCD, NCD, CD). Only the three algorithms with highest (green) and lowest (red) AUC values are shown.

**Figure 7**
**ROC curves for alignment and k-mer frequencies**. ROC curves for global and local alignment and k-mer frequencies, one for each data set (CK-36-PDB and SP-86-PDB) and each classification task (class, architecture, topology).

See this image and copyright information in PMC

Cited by

Using Recursive Feature Selection with Random Forest to Improve Protein Structural Class Prediction for Low-Similarity Sequences.
Wang Y, Xu Y, Yang Z, Liu X, Dai Q. Wang Y, et al. Comput Math Methods Med. 2021 May 7;2021:5529389. doi: 10.1155/2021/5529389. eCollection 2021. Comput Math Methods Med. 2021. PMID: 34055035 Free PMC article.
Data compression for sequencing data.
Deorowicz S, Grabowski S. Deorowicz S, et al. Algorithms Mol Biol. 2013 Nov 18;8(1):25. doi: 10.1186/1748-7188-8-25. Algorithms Mol Biol. 2013. PMID: 24252160 Free PMC article.
Information Theory Opens New Dimensions in Experimental Studies of Animal Behaviour and Communication.
Reznikova Z. Reznikova Z. Animals (Basel). 2023 Mar 26;13(7):1174. doi: 10.3390/ani13071174. Animals (Basel). 2023. PMID: 37048430 Free PMC article. Review.
Comparison of Compression-Based Measures with Application to the Evolution of Primate Genomes.
Pratas D, Silva RM, Pinho AJ. Pratas D, et al. Entropy (Basel). 2018 May 23;20(6):393. doi: 10.3390/e20060393. Entropy (Basel). 2018. PMID: 33265483 Free PMC article.
Comparison study on statistical features of predicted secondary structures for protein structural class prediction: From content to position.
Dai Q, Li Y, Liu X, Yao Y, Cao Y, He P. Dai Q, et al. BMC Bioinformatics. 2013 May 4;14:152. doi: 10.1186/1471-2105-14-152. BMC Bioinformatics. 2013. PMID: 23641706 Free PMC article.

See all "Cited by" articles

References

1. Kolmogorov Library Supplementary Material Web Page http://www.math.unipa.it/~raffaele/kolmogorov/
1. Kruskal J, Sankoff D, (Eds) Time Wraps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley; 1983.
1. Waterman M. Introduction to Computational Biology Maps, Sequences and Genomes. Chapman Hall; 1995.
1. Gusfield D. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press; 1997.
1. Vinga S, Almeida J. Alignment-Free Sequence Comparison: A Review. Bioinformatics. 2003;19:513–523. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment

Affiliation

Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources