. 2014;15 Suppl 2(Suppl 2):S3.

doi: 10.1186/1471-2105-15-S2-S3. Epub 2014 Jan 24.

Using distances between Top-n-gram and residue pairs for protein remote homology detection

Bin Liu, Jinghao Xu, Quan Zou, Ruifeng Xu, Xiaolong Wang, Qingcai Chen

PMID: 24564580
PMCID: PMC4015815
DOI: 10.1186/1471-2105-15-S2-S3

Using distances between Top-n-gram and residue pairs for protein remote homology detection

Bin Liu et al. BMC Bioinformatics. 2014.

. 2014;15 Suppl 2(Suppl 2):S3.

doi: 10.1186/1471-2105-15-S2-S3. Epub 2014 Jan 24.

Authors

Bin Liu, Jinghao Xu, Quan Zou, Ruifeng Xu, Xiaolong Wang, Qingcai Chen

PMID: 24564580
PMCID: PMC4015815
DOI: 10.1186/1471-2105-15-S2-S3

Abstract

Background: Protein remote homology detection is one of the central problems in bioinformatics, which is important for both basic research and practical application. Currently, discriminative methods based on Support Vector Machines (SVMs) achieve the state-of-the-art performance. Exploring feature vectors incorporating the position information of amino acids or other protein building blocks is a key step to improve the performance of the SVM-based methods.

Results: Two new methods for protein remote homology detection were proposed, called SVM-DR and SVM-DT. SVM-DR is a sequence-based method, in which the feature vector representation for protein is based on the distances between residue pairs. SVM-DT is a profile-based method, which considers the distances between Top-n-gram pairs. Top-n-gram can be viewed as a profile-based building block of proteins, which is calculated from the frequency profiles. These two methods are position dependent approaches incorporating the sequence-order information of protein sequences. Various experiments were conducted on a benchmark dataset containing 54 families and 23 superfamilies. Experimental results showed that these two new methods are very promising. Compared with the position independent methods, the performance improvement is obvious. Furthermore, the proposed methods can also provide useful insights for studying the features of protein families.

Conclusion: The better performance of the proposed methods demonstrates that the position dependant approaches are efficient for protein remote homology detection. Another advantage of our methods arises from the explicit feature space representation, which can be used to analyze the characteristic features of protein families. The source code of SVM-DT and SVM-DR is available at http://bioinformatics.hitsz.edu.cn/DistanceSVM/index.jsp.

PubMed Disclaimer

Figures

**Figure 1**
**The process of generating Distance-based Top-1-gram feature vector**. A protein S is input into the PSI-BLAST software to do the multiple sequence alignments against a non-redundant database, and then the frequency profile is calculated from the multiple sequence alignments. The frequencies of the 20 standard amino acids in each column of the frequency profile are sorted in descending order. Top-1-gram is the most frequent amino acid in each column of frequency profile. S can be represented as a sequence of Top-1-grms S' by combining all the obtained Top-1-grams according to their sequence order. Assuming that the distance threshold *d_MAX*is set as 2, the feature vector is the combination of Top-1-gram pairs at distance 0, 1, and 2.

**Figure 2**
**Algorithm of construing the Distance-based Top-1-gram feature vector**. The input of this algorithm is the Top-1-gram sequence S', distance threshold *d_MAX*, and the output is the feature vector of distance-based Top-1-grams. The vector of alphabet *Index []*is the index of all the Top-1-gram in the alphabet Ӑand 20 is the size of Ӑ, for example, index 0 indicates the first Top-1-gram in the alphabet Ӑ(t₁= A), and index 19 is the last Top-1-gram in the alphabet Ӑ(*t₁₉*= V).

**Figure 3**
The average ROC scores of the SVM-DR and SVM-DT with different distance threshold values of *d_MAX*.

**Figure 4**
**The discriminative power (L₂-norm) of discriminant vectors for all possible combinations of Top-1-gram pair (A) and residue pair (B) of protein family 2.5.1.3**. The amino acids are identified by their one-letter code. The amino acids labeled by x-axis and y-axis in figure(A) indicate the first Top-1-gram and the second Top-1-gram in Top-1-gram pairs of SVM-DT, respectively; the amino acids labeled by x-axis and y-axis in figure (B) indicate the first residue and the second residue in residue pairs of SVM-DR, respectively. The adjacent color bar shows the mapping of L₂-norm values.

**Figure 5**
The discriminant weights of the most discriminative Top-1-gram pairs (G, G) and (L, L) of SVM-DT for family 2.5.1.3 are shown in figure (A) and (B), respectively; the discriminant weights of the most discriminative residue pairs (G, G) and (L, L) of SVM-DR for family 2.5.1.3 are shown in figure (C) and (D), respectively.

See this image and copyright information in PMC

Cited by

CarSite-II: an integrated classification algorithm for identifying carbonylated sites based on K-means similarity-based undersampling and synthetic minority oversampling techniques.
Zuo Y, Lin J, Zeng X, Zou Q, Liu X. Zuo Y, et al. BMC Bioinformatics. 2021 Apr 26;22(1):216. doi: 10.1186/s12859-021-04134-3. BMC Bioinformatics. 2021. PMID: 33902446 Free PMC article.
Protein binding site prediction by combining hidden Markov support vector machine and profile-based propensities.
Liu B, Liu B, Liu F, Wang X. Liu B, et al. ScientificWorldJournal. 2014;2014:464093. doi: 10.1155/2014/464093. Epub 2014 Jul 14. ScientificWorldJournal. 2014. PMID: 25133234 Free PMC article.
A novel two-way rebalancing strategy for identifying carbonylation sites.
Chen L, Jing XY, Hao Y, Liu W, Zhu X, Han W. Chen L, et al. BMC Bioinformatics. 2023 Nov 13;24(1):429. doi: 10.1186/s12859-023-05551-2. BMC Bioinformatics. 2023. PMID: 37957582 Free PMC article.
dRHP-PseRA: detecting remote homology proteins using profile-based pseudo protein sequence and rank aggregation.
Chen J, Long R, Wang XL, Liu B, Chou KC. Chen J, et al. Sci Rep. 2016 Sep 1;6:32333. doi: 10.1038/srep32333. Sci Rep. 2016. PMID: 27581095 Free PMC article.
Design of Protein Segments and Peptides for Binding to Protein Targets.
Gupta S, Azadvari N, Hosseinzadeh P. Gupta S, et al. Biodes Res. 2022 Apr 15;2022:9783197. doi: 10.34133/2022/9783197. eCollection 2022. Biodes Res. 2022. PMID: 37850124 Free PMC article. Review.

See all "Cited by" articles

References

1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic Local Alignment Search Tool. J Mol Biol. 1990;215(3):403–410. - PubMed
1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs. Nucleic Acids Res. 1997;25(17):3389–3402. doi: 10.1093/nar/25.17.3389. - DOI - PMC - PubMed
1. Karplus K, Barrett C, Hughey R. Hidden Markov Models for Detecting Remote Protein Homologies. Bioinformatics. 1998;14(10):846–856. doi: 10.1093/bioinformatics/14.10.846. - DOI - PubMed
1. Såding J. Protein Homology Detection by HMM-HMM Comparison. Bioinformatics. 2005;21(9):951–960. - PubMed
1. Sadreyev RI, Tang M, Kim B-H, Grishin NV. COMPASS Server for Homology Detection: Improved Statistical Accuracy, Speed and Functionality. Nucleic Acids Res. 2009;37(Web Server):W90–W94. doi: 10.1093/nar/gkp360. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Using distances between Top-n-gram and residue pairs for protein remote homology detection

Using distances between Top-n-gram and residue pairs for protein remote homology detection

Authors

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources