Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Aug 23:8:307.
doi: 10.1186/1471-2105-8-307.

Protein structural similarity search by Ramachandran codes

Affiliations

Protein structural similarity search by Ramachandran codes

Wei-Cheng Lo et al. BMC Bioinformatics. .

Abstract

Background: Protein structural data has increased exponentially, such that fast and accurate tools are necessary to access structure similarity search. To improve the search speed, several methods have been designed to reduce three-dimensional protein structures to one-dimensional text strings that are then analyzed by traditional sequence alignment methods; however, the accuracy is usually sacrificed and the speed is still unable to match sequence similarity search tools. Here, we aimed to improve the linear encoding methodology and develop efficient search tools that can rapidly retrieve structural homologs from large protein databases.

Results: We propose a new linear encoding method, SARST (Structural similarity search Aided by Ramachandran Sequential Transformation). SARST transforms protein structures into text strings through a Ramachandran map organized by nearest-neighbor clustering and uses a regenerative approach to produce substitution matrices. Then, classical sequence similarity search methods can be applied to the structural similarity search. Its accuracy is similar to Combinatorial Extension (CE) and works over 243,000 times faster, searching 34,000 proteins in 0.34 sec with a 3.2-GHz CPU. SARST provides statistically meaningful expectation values to assess the retrieved information. It has been implemented into a web service and a stand-alone Java program that is able to run on many different platforms.

Conclusion: As a database search method, SARST can rapidly distinguish high from low similarities and efficiently retrieve homologous structures. It demonstrates that the easily accessible linear encoding methodology has the potential to serve as a foundation for efficient protein structural similarity search tools. These search tools are supposed applicable to automated and high-throughput functional annotations or predictions for the ever increasing number of published protein structures in this post-genomic era.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Flowchart of SARST approaches. Three-dimensional (3D) protein structures are first transformed onto two-dimensional (2D) Ramachandran maps and then further converted into one-dimensional (1D) text strings. Thus, a structural similarity search could be performed by classical sequence similarity search methods. The Ramachandran plot shown here was generated by PROCHECK [54]. (Note that "similarity search" is more typically termed as "alignment search"; however, considering that SARST is designed as a search method rather than an alignment tool, we will use the former term throughout this report to avoid misunderstanding.)
Figure 2
Figure 2
The Ramachandran map of SARST. The 1,296 cells on the map were clustered into 22 groups by the Ramachandran sequential transformation algorithm. See text for details.
Figure 3
Figure 3
Average precision-recall curves of several search methods. FAST was the most accurate search method. SARST ranked third and achieved precisions ~4% lower than CE, which was the second most accurate method in this experiment. Linear encoding methods TOPSCAN [17], YAKUSA [21] and 3D-BLAST [22] describe protein structures as strings. ProtDex2 transforms protein structures into indexes [30]. These curves of ProtDex2 and TOPSCAN were adapted from Aung and Tan's report [30]. The precision percentage is plotted on the y-axis and the recall percentage is plotted on the x-axis.
Figure 4
Figure 4
Performances among different structural classes and proteins with incomplete structures. The average fallouts after the retrieval of 80 proteins were calculated. Because fallout is a measure of the false positive rate, these data demonstrate that the performances of SARST are fairly even (i.e. no obvious bias) among the four major classes as compared to those of BLAST. The fallouts of SARST are generally lower than YAKUSA [21] and 3D-BLAST [22], both of which are protein structural similarity search tools with linear encoding methodologies. When tested with query proteins having incomplete local backbone structures, SARST outperforms other linear encoding methods and CE. The query proteins used in this experiment was set by Aung and Tan [30]; the subset of incomplete structures and the extent of incompleteness are listed in the supplementary materials [see Additional file 5].
Figure 5
Figure 5
Effects of sequence identities on the precision of several search methods. The structure similarity search method, SARST, was able to detect remote homology with increased precisions compared with other linear encoding algorithms and the conventional amino acid sequence search method, BLAST. These data also show that there is still room left for the improvement of linear encoding methodology. Possible solutions are proposed in Discussion. The average precisions used in this figure were calculated at the representative 60% recall level.
Figure 6
Figure 6
Examples of distantly related proteins retrieved by SARST. (a) The superimposed structures (cross-eye stereo view) of the interleukin 8-like chemokines from human, [SCOP:d1b3aa_] (blue) and [SCOP:d1tvxa_] (red). These two proteins have a low sequence identity while their structures are highly similar. The minimum RMSD calculated from positive positions of the Ramachandran sequence alignment is 1.68 Å. (b) Three-dimensional structures and the superimposition of [SCOP:d1tpo__] (blue) (SCOP sccs id: b.47.1.2), the trypsin from Bos taurus, and [SCOP:d1p3ca_] (red) (SCOP sccs id: b.47.1.1), a Bacillus intermedius glutamyl endopeptidase. These two proteases belong to different families but have similar structures. Although the amino acid sequence alignment fails to detect their functional similarities, the catalytic triad residues (highlighted in green) are well aligned by SARST. Their minimum RMSD is 4.17 Å, whereas their amino acid sequence identity is 22%. The secondary structural cartoons were generated by PROCHECK [54] and then modified with colors and gaps.

References

    1. Chothia C, Lesk AM. The relation between the divergence of sequence and structure in proteins. Embo J. 1986;5:823–826. - PMC - PubMed
    1. Murzin AG. How far divergent evolution goes in proteins. Curr Opin Struct Biol. 1998;8:380–387. doi: 10.1016/S0959-440X(98)80073-0. - DOI - PubMed
    1. Sauder JM, Arthur JW, Dunbrack RL., Jr. Large-scale comparison of protein sequence alignment algorithms with structure alignments. Proteins. 2000;40:6–22. doi: 10.1002/(SICI)1097-0134(20000701)40:1<6::AID-PROT30>3.0.CO;2-7. - DOI - PubMed
    1. Holm L, Sander C. Protein structure comparison by alignment of distance matrices. J Mol Biol. 1993;233:123–138. doi: 10.1006/jmbi.1993.1489. - DOI - PubMed
    1. Shindyalov IN, Bourne PE. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 1998;11:739–747. doi: 10.1093/protein/11.9.739. - DOI - PubMed

Publication types

MeSH terms