Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014;15 Suppl 2(Suppl 2):S3.
doi: 10.1186/1471-2105-15-S2-S3. Epub 2014 Jan 24.

Using distances between Top-n-gram and residue pairs for protein remote homology detection

Using distances between Top-n-gram and residue pairs for protein remote homology detection

Bin Liu et al. BMC Bioinformatics. 2014.

Abstract

Background: Protein remote homology detection is one of the central problems in bioinformatics, which is important for both basic research and practical application. Currently, discriminative methods based on Support Vector Machines (SVMs) achieve the state-of-the-art performance. Exploring feature vectors incorporating the position information of amino acids or other protein building blocks is a key step to improve the performance of the SVM-based methods.

Results: Two new methods for protein remote homology detection were proposed, called SVM-DR and SVM-DT. SVM-DR is a sequence-based method, in which the feature vector representation for protein is based on the distances between residue pairs. SVM-DT is a profile-based method, which considers the distances between Top-n-gram pairs. Top-n-gram can be viewed as a profile-based building block of proteins, which is calculated from the frequency profiles. These two methods are position dependent approaches incorporating the sequence-order information of protein sequences. Various experiments were conducted on a benchmark dataset containing 54 families and 23 superfamilies. Experimental results showed that these two new methods are very promising. Compared with the position independent methods, the performance improvement is obvious. Furthermore, the proposed methods can also provide useful insights for studying the features of protein families.

Conclusion: The better performance of the proposed methods demonstrates that the position dependant approaches are efficient for protein remote homology detection. Another advantage of our methods arises from the explicit feature space representation, which can be used to analyze the characteristic features of protein families. The source code of SVM-DT and SVM-DR is available at http://bioinformatics.hitsz.edu.cn/DistanceSVM/index.jsp.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The process of generating Distance-based Top-1-gram feature vector. A protein S is input into the PSI-BLAST software to do the multiple sequence alignments against a non-redundant database, and then the frequency profile is calculated from the multiple sequence alignments. The frequencies of the 20 standard amino acids in each column of the frequency profile are sorted in descending order. Top-1-gram is the most frequent amino acid in each column of frequency profile. S can be represented as a sequence of Top-1-grms S' by combining all the obtained Top-1-grams according to their sequence order. Assuming that the distance threshold dMAX is set as 2, the feature vector is the combination of Top-1-gram pairs at distance 0, 1, and 2.
Figure 2
Figure 2
Algorithm of construing the Distance-based Top-1-gram feature vector. The input of this algorithm is the Top-1-gram sequence S', distance threshold dMAX, and the output is the feature vector of distance-based Top-1-grams. The vector of alphabet Index []is the index of all the Top-1-gram in the alphabet Ӑand 20 is the size of Ӑ, for example, index 0 indicates the first Top-1-gram in the alphabet Ӑ(t1 = A), and index 19 is the last Top-1-gram in the alphabet Ӑ(t19 = V).
Figure 3
Figure 3
The average ROC scores of the SVM-DR and SVM-DT with different distance threshold values of dMAX.
Figure 4
Figure 4
The discriminative power (L2-norm) of discriminant vectors for all possible combinations of Top-1-gram pair (A) and residue pair (B) of protein family 2.5.1.3. The amino acids are identified by their one-letter code. The amino acids labeled by x-axis and y-axis in figure(A) indicate the first Top-1-gram and the second Top-1-gram in Top-1-gram pairs of SVM-DT, respectively; the amino acids labeled by x-axis and y-axis in figure (B) indicate the first residue and the second residue in residue pairs of SVM-DR, respectively. The adjacent color bar shows the mapping of L2-norm values.
Figure 5
Figure 5
The discriminant weights of the most discriminative Top-1-gram pairs (G, G) and (L, L) of SVM-DT for family 2.5.1.3 are shown in figure (A) and (B), respectively; the discriminant weights of the most discriminative residue pairs (G, G) and (L, L) of SVM-DR for family 2.5.1.3 are shown in figure (C) and (D), respectively.

Similar articles

Cited by

References

    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic Local Alignment Search Tool. J Mol Biol. 1990;215(3):403–410. - PubMed
    1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs. Nucleic Acids Res. 1997;25(17):3389–3402. doi: 10.1093/nar/25.17.3389. - DOI - PMC - PubMed
    1. Karplus K, Barrett C, Hughey R. Hidden Markov Models for Detecting Remote Protein Homologies. Bioinformatics. 1998;14(10):846–856. doi: 10.1093/bioinformatics/14.10.846. - DOI - PubMed
    1. Såding J. Protein Homology Detection by HMM-HMM Comparison. Bioinformatics. 2005;21(9):951–960. - PubMed
    1. Sadreyev RI, Tang M, Kim B-H, Grishin NV. COMPASS Server for Homology Detection: Improved Statistical Accuracy, Speed and Functionality. Nucleic Acids Res. 2009;37(Web Server):W90–W94. doi: 10.1093/nar/gkp360. - DOI - PMC - PubMed

Publication types

MeSH terms