Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Jan 30;10 Suppl 1(Suppl 1):S46.
doi: 10.1186/1471-2105-10-S1-S46.

Towards comprehensive structural motif mining for better fold annotation in the "twilight zone" of sequence dissimilarity

Affiliations

Towards comprehensive structural motif mining for better fold annotation in the "twilight zone" of sequence dissimilarity

Yi Jia et al. BMC Bioinformatics. .

Abstract

Background: Automatic identification of structure fingerprints from a group of diverse protein structures is challenging, especially for proteins whose divergent amino acid sequences may fall into the "twilight-" or "midnight-" zones where pair-wise sequence identities to known sequences fall below 25% and sequence-based functional annotations often fail.

Results: Here we report a novel graph database mining method and demonstrate its application to protein structure pattern identification and structure classification. The biologic motivation of our study is to recognize common structure patterns in "immunoevasins", proteins mediating virus evasion of host immune defense. Our experimental study, using both viral and non-viral proteins, demonstrates the efficiency and efficacy of the proposed method.

Conclusion: We present a theoretic framework, offer a practical software implementation for incorporating prior domain knowledge, such as substitution matrices as studied here, and devise an efficient algorithm to identify approximate matched frequent subgraphs. By doing so, we significantly expanded the analytical power of sophisticated data mining algorithms in dealing with large volume of complicated and noisy protein structure data. And without loss of generality, choice of appropriate compatibility matrices allows our method to be easily employed in domains where subgraph labels have some uncertainty.

PubMed Disclaimer

Figures

Figure 1
Figure 1
3D structure and corresponding graph of one sample protein. Upper: One segment of the 3D structure of the 1FP5A Immunoglobulin C1-type protein (the paired Fcε 3 and 4 domains of IgE). Lower: The corresponding graph. Vertices are Cα atoms. Covalent edges are represented in heavy magenta while non-covalent edges defined by Almost Delaunay Tesselation(ε = 0.1) appear in thin blue.
Figure 2
Figure 2
Graph database and compatibility matrix. Example of a graph database D and a compatibility matrix M.
Figure 3
Figure 3
Example of frequent subgraphs and approximate frequent subgraphs. Given the graph database D in Figure 2 and the support threshold σ = 2/3,the left side shows the frequent subgraphs mined by the general exact graph mining. Given the compatibility matrix M in Figure 2, isomorphism threshold τ = 0:4, and support threshold σ = 2/3. The right side presents the frequent approximate subgraphs in D.
Figure 4
Figure 4
Approximate subgraph. A pattern T, a graph Q and approximate subgraph G' of Q.
Figure 5
Figure 5
The procedure of experimental research.
Figure 6
Figure 6
Number of patterns for Immunoglobulin C1 set acquired by APGM. Example of a graph database D and a compatibility matrix M.
Figure 7
Figure 7
Distribution and significance of features among Immunoglobulin C1 Proteins. Upper: Distribution of frequent subgraph features among Immunoglobulin C1 proteins. Lower: Significance of frequent subgraph features among Immunoglobulin C1 proteins. Both figures are constructed for the set for classification. There are 202 patterns that are mined with the support threshold σ = 4.5 and the isomorphism threshold τ = 0.35.
Figure 8
Figure 8
Distribution and significance of features among Immunoglobulin V proteins. Upper: Distribution of frequent subgraph features among Immunoglobulin V proteins. Lower: Significance of frequent subgraph features among Immunoglobulin V proteins. Both figures are constructed for the set for classification. There are 160 patterns that are mined with the support threshold σ = 4.5 and the isomorphism threshold τ = 0.75.
Figure 9
Figure 9
Computational performance comparison. We compared the computational performance between APGM and MGM using synthetic data sets. APGM used isomorphism threshold τ = 1.0, 0.8, 0.7, 0.6. Given the patterns' number N and running time t (s), rate = N/t.

Similar articles

Cited by

References

    1. J L, HL P. Antigen presentation and the ubiquitin-proteasome system in host-pathogen interactions. Adv Immunol. 2006;92:225–305. - PMC - PubMed
    1. Judson KA, Lubinski JM, Jiang M, Chang Y, Eisenberg RJ, Cohen GH, Friedman HM. Blocking Immune Evasion as a Novel Approach for Prevention and Treatment of Herpes Simplex Virus Infection. J Virol. 2003;77:12639–12645. - PMC - PubMed
    1. RF D. Of URFs and ORFs: A Primer on How to Analyze Derived Amino Acid Sequences. Vol. 92. Mill Valley: University Science Books; 1986.
    1. B R. Twilight zone of protein sequence alignments. Protein Eng. 1999;12:85–94. - PubMed
    1. JU B, R L, D E. A method to identify protein sequences that fold into a known three-dimensional structure. Science. 253:164–170. 1991 Jul 12. - PubMed

Publication types

LinkOut - more resources