Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2003 Aug;12(8):1604-12.
doi: 10.1110/ps.0379403.

Exploring the nonlinear geometry of protein homology

Affiliations

Exploring the nonlinear geometry of protein homology

Michael A Farnum et al. Protein Sci. 2003 Aug.

Abstract

The explosion of biological data resulting from genomic and proteomic research has created a pressing need for data analysis techniques that work effectively on a large scale. An area of particular interest is the organization and visualization of large families of protein sequences. An increasingly popular approach is to embed the sequences into a low-dimensional Euclidean space in a way that preserves some predefined measure of sequence similarity. This method has been shown to produce maps that exhibit global order and continuity and reveal important evolutionary, structural, and functional relationships between the embedded proteins. However, protein sequences are related by evolutionary pathways that exhibit highly nonlinear geometry, which is invisible to classical embedding procedures such as multidimensional scaling (MDS) and nonlinear mapping (NLM). Here, we describe the use of stochastic proximity embedding (SPE) for producing Euclidean maps that preserve the intrinsic dimensionality and metric structure of the data. SPE extends previous approaches in two important ways: (1) It preserves only local relationships between closely related sequences, thus allowing the map to unfold and reveal its intrinsic dimension, and (2) it scales linearly with the number of sequences and therefore can be applied to very large protein families. The merits of the algorithm are illustrated using examples from the protein kinase and nuclear hormone receptor superfamilies.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Two-dimensional stochastic proximity embedding of (A) the kinase domains identified and classified by Hanks and Hunter (1995), and (B) the kinase domains identified and classified by Manning et al. (2002). Both maps were constructed using a pairwise distance measure based on a multiple sequence alignment and the PAM250 amino acid exchange matrix.
Figure 2.
Figure 2.
Stress and number of connected components of the 2D SPE map of the Manning kinase superfamily as a function of the neighborhood radius, rc. For a well-sampled noiseless manifold embedded in the intrinsic dimension, the ideal cutoff is any value that leads to zero stress and a single connected component. For sparsely sampled data sets that contain discontinuities (such as the ones examined here), no such value exists, and the “ideal” cutoff is one that represents a good compromise between the stress and the number of connected components, and leads to a visually meaningful map. This value is typically located near the point where the two normalized curves intersect.
Figure 3.
Figure 3.
Two-dimensional SPE maps of the Manning kinase domains using a neighborhood radius of (A) 0.87, (B) 0.89, and (C) 0.91. As the cutoff decreases, distinct families that are not discernible in the conventional nonlinear map (Fig. 1B ▶) begin to emerge and become more clearly delineated until we reach the fragmentation threshold. At that point, the manifold breaks down into a large number of disconnected fragments and singletons, and the map looses its structure and interpretability.
Figure 3.
Figure 3.
Two-dimensional SPE maps of the Manning kinase domains using a neighborhood radius of (A) 0.87, (B) 0.89, and (C) 0.91. As the cutoff decreases, distinct families that are not discernible in the conventional nonlinear map (Fig. 1B ▶) begin to emerge and become more clearly delineated until we reach the fragmentation threshold. At that point, the manifold breaks down into a large number of disconnected fragments and singletons, and the map looses its structure and interpretability.
Figure 4.
Figure 4.
Stress and number of connected components of the 2D SPE map of the CMGC subfamily of the Manning kinase domains as a function of the neighborhood radius, rc. The embeddings were based on the same multiple sequence alignment and pairwise similarity scores that were used to construct the maps in Figures 1B ▶ and 3 ▶. Because fewer and more closely related sequences are embedded, the neighborhood radius that reveals the internal structure of this cluster is smaller than that determined for the entire superfamily.
Figure 5.
Figure 5.
Two-dimensional SPE maps of the CMGC subfamily of the Manning kinase domains using a neighborhood radius of (A) 0.89, and (B) 0.87. Subtle structure within subfamilies is obscured by the presence of distant sequences (A) and is only discernible when analyzed independently (B).
Figure 6.
Figure 6.
Stress and number of connected components of the 2D SPE map of the NHR ligand-binding domains as a function of the neighborhood radius, rc.
Figure 7.
Figure 7.
Two-dimensional SPE maps of the NHR ligand-binding domains using a neighborhood radius of (A) rc = ∞, and (B) rc = 0.62.

Similar articles

Cited by

References

    1. Agrafiotis, D.K. 1997. A new method for analyzing protein sequence relationships based on Sammon maps. Protein Sci. 6 287–293. - PMC - PubMed
    1. Agrafiotis, D.K. and Xu, H. 2002. A self-organizing principle for learning nonlinear manifolds. Proc. Natl. Acad. Sci. 99 15869–15872. - PMC - PubMed
    1. Apostal, I.S. and Szpankowski, W. 1999. Indexing and mapping of proteins using a modified nonlinear Sammon projection. J. Comput. Chem. 20 1049–1059.
    1. Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Birney, E., Biswas, M., Bucher, P., Cerutti, L., Corpet, F., Croning, M.D.R., et al. 2001. The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res. 29 37–40. - PMC - PubMed
    1. Attwood, T.K., Croning, M.D., Flower, D.R., Lewis, A.P., Mabey, J.E., Scordis, P., Selley, J.N., and Wright, W. 2000. PRINTS-S: The database formerly known as PRINTS. Nucleic Acids Res. 28 225–227. - PMC - PubMed