. 2008 Sep 23:9:394.

doi: 10.1186/1471-2105-9-394.

Comparison study on k-word statistical measures for protein: from sequence to 'sequence space'

Qi Dai¹, Tianming Wang

Affiliations

PMID: 18811946
PMCID: PMC2571980
DOI: 10.1186/1471-2105-9-394

Comparison study on k-word statistical measures for protein: from sequence to 'sequence space'

Qi Dai et al. BMC Bioinformatics. 2008.

. 2008 Sep 23:9:394.

doi: 10.1186/1471-2105-9-394.

Authors

Qi Dai¹, Tianming Wang

Affiliation

¹ Department of Applied Mathematics, Dalian University of Technology, Dalian 116024, PR China. daiailiu2004@yahoo.com.cn

PMID: 18811946
PMCID: PMC2571980
DOI: 10.1186/1471-2105-9-394

Abstract

Background: Many proposed statistical measures can efficiently compare protein sequence to further infer protein structure, function and evolutionary information. They share the same idea of using k-word frequencies of protein sequences. Given a protein sequence, the information on its related protein sequences hasn't been used for protein sequence comparison until now. This paper proposed a scheme to construct protein 'sequence space' which was associated with protein sequences related to the given protein, and the performances of statistical measures were compared when they explored the information on protein 'sequence space' or not. This paper also presented two statistical measures for protein: gre.k (generalized relative entropy) and gsm.k (gapped similarity measure).

Results: We tested statistical measures based on protein 'sequence space' or not with three data sets. This not only offers the systematic and quantitative experimental assessment of these statistical measures, but also naturally complements the available comparison of statistical measures based on protein sequence. Moreover, we compared our statistical measures with alignment-based measures and the existing statistical measures. The experiments were grouped into two sets. The first one, performed via ROC (Receiver Operating Curve) analysis, aims at assessing the intrinsic ability of the statistical measures to discriminate and classify protein sequences. The second set of the experiments aims at assessing how well our measure does in phylogenetic analysis. Based on the experiments, several conclusions can be drawn and, from them, novel valuable guidelines for the use of protein 'sequence space' and statistical measures were obtained.

Conclusion: Alignment-based measures have a clear advantage when the data is high redundant. The more efficient statistical measure is the novel gsm.k introduced by this article, the cos.k followed. When the data becomes less redundant, gre.k proposed by us achieves a better performance, but all the other measures perform poorly on classification tasks. Almost all the statistical measures achieve improvement by exploring the information on 'sequence space' as word's length increases, especially for less redundant data. The reasonable results of phylogenetic analysis confirm that Gdis.k based on 'sequence space' is a reliable measure for phylogenetic analysis. In summary, our quantitative analysis verifies that exploring the information on 'sequence space' is a promising way to improve the abilities of statistical measures for protein comparison.

PubMed Disclaimer

Figures

**Figure 1**
**ROC curves for data CK**. ROC (a) for our measures, alignment-based measures and other statistical measures, all the statistical measures are based on k-word frequencies of protein sequence, with the parameter values as suffix. ROC (b) for our measures, alignment-based measures and other statistical measures, all the statistical measures are based on k-word frequencies of protein 'sequence space', with the parameter values as suffix. All the abbreviations of (dis)similarity measures are illustrated in the "List of abbreviations" section. A random classifier would generate equal proportions of FP and TP classifications, which corresponds to the ROC diagonal (dashed line).

**Figure 2**
**ROC curves for data RS**. ROC (a) for our measures, alignment-based measures and other statistical measures, all the statistical measures are based on k-word frequencies of protein sequence, with the parameter values as suffix. ROC (b) for our measures, alignment-based measures and other statistical measures, all the statistical measures are based on k-word frequencies of protein 'sequence space', with the parameter values as suffix. All the abbreviations of (dis)similarity measures are illustrated in the "List of abbreviations" section. A random classifier would generate equal proportions of FP and TP classifications, which corresponds to the ROC diagonal (dashed line).

**Figure 3**
**ROC curves for data SP**. ROC (a) for our measures, alignment-based measures and other statistical measures, all the statistical measures are based on k-word frequencies of protein sequence, with the parameter values as suffix. ROC (b) for our measures, alignment-based measures and other statistical measures, all the statistical measures are based on k-word frequencies of protein 'sequence space', with the parameter values as suffix. All the abbreviations of (dis)similarity measures are illustrated in the "List of abbreviations" section. A random classifier would generate equal proportions of FP and TP classifications, which corresponds to the ROC diagonal (dashed line).

**Figure 4**
*DAUC* values for data CK. The *DAUC* values of seven statistical measures for data CK. All statistical measures based on k-word frequencies of protein 'sequence space' run with k from 1 to 4, where protein 'sequence space' is constructed according to ten score matrices. One graph presents each word length (from 1 to 4).

**Figure 5**
**DAUC values for data RS**. The *DAUC* values of seven statistical measures for data RS. All statistical measures based on k-word frequencies of protein 'sequence space' run with k from 1 to 4, where protein 'sequence space' is constructed according to ten score matrices. One graph presents each word length (from 1 to 4).

**Figure 6**
**DAUC values for data SP**. The *DAUC* values of seven statistical measures for data SP. All statistical measures based on k-word frequencies of protein 'sequence space' run with k from 1 to 4, where protein 'sequence space' is constructed according to ten score matrices. One graph presents each word length (from 1 to 4).

**Figure 7**
**MAUC values for data sets CK, RS and SP**. The *MAUC* values for the data CK, RS and SP, one for each data. All the statistical measures are based on k-word frequencies of protein 'sequence space', with ten score matrices to build protein 'sequence space'.

**Figure 8**
**The diagram of phylogenetic relationships**. Phylogenetic relationships are obtained by neighbor-joining program based on our statistical distance measure *Gdis.k* using all six SMC subfamilies, as well as the related MukB and Rad50. Bootstraps are based on 100 replications, and bootstrap values, lower than 50, are hidden.

**Figure 9**
**Representation of a star set**. a: the diagram of star set, S is similar to A, T and N in BLOSUM62 substitution matrix, and S is the midpoint; b: the star set consists of the midpoint S and vertices A, T and N.

See this image and copyright information in PMC

Cited by

Comparison study on statistical features of predicted secondary structures for protein structural class prediction: From content to position.
Dai Q, Li Y, Liu X, Yao Y, Cao Y, He P. Dai Q, et al. BMC Bioinformatics. 2013 May 4;14:152. doi: 10.1186/1471-2105-14-152. BMC Bioinformatics. 2013. PMID: 23641706 Free PMC article.
Alignment-free Transcriptomic and Metatranscriptomic Comparison Using Sequencing Signatures with Variable Length Markov Chains.
Liao W, Ren J, Wang K, Wang S, Zeng F, Wang Y, Sun F. Liao W, et al. Sci Rep. 2016 Nov 23;6:37243. doi: 10.1038/srep37243. Sci Rep. 2016. PMID: 27876823 Free PMC article.
A Markovian analysis of bacterial genome sequence constraints.
Skewes AD, Welch RD. Skewes AD, et al. PeerJ. 2013 Aug 29;1:e127. doi: 10.7717/peerj.127. eCollection 2013. PeerJ. 2013. PMID: 24010012 Free PMC article.
Assembly-free genome comparison based on next-generation sequencing reads and variable length patterns.
Comin M, Schimd M. Comin M, et al. BMC Bioinformatics. 2014;15 Suppl 9(Suppl 9):S1. doi: 10.1186/1471-2105-15-S9-S1. Epub 2014 Sep 10. BMC Bioinformatics. 2014. PMID: 25252700 Free PMC article.
Comparison of metatranscriptomic samples based on k-tuple frequencies.
Wang Y, Liu L, Chen L, Chen T, Sun F. Wang Y, et al. PLoS One. 2014 Jan 2;9(1):e84348. doi: 10.1371/journal.pone.0084348. eCollection 2014. PLoS One. 2014. PMID: 24392128 Free PMC article.

See all "Cited by" articles

References

1. Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths JS, Howe KL, Marshall M, Sonnhammer ELL. The Pfam Protein FamiliesDatabase. Nucleic Acids Res. 2002;30:276–280. - PMC - PubMed
1. Andreeva A, Howorth D, Brenner SE, Hubbard TJP, Chothia C, Murzin AG. SCOP database in refinements integrate structure and sequence family data. Nucleic Acid Res. 2004;32:D226–D229. - PMC - PubMed
1. Bairoch A, Apweiler R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 2000;28:45–48. - PMC - PubMed
1. Wu CH, Huang H, Arminski L, Castro-Alvear J, Chen Y, Hu ZZ, Ledley RS, Lewis KG, Mewes HW, Orcutt BC, Suzek BE, Tsugita A, Vinayaka CR, Yeh LSL, Zhang J, Barker WC. The Protein Information Resource, an integrated public resource of functional annotation of proteins. Nucleic Acids Res. 2002;30:35–37. - PMC - PubMed
1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Comparison study on k-word statistical measures for protein: from sequence to 'sequence space'

Affiliation

Comparison study on k-word statistical measures for protein: from sequence to 'sequence space'

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources