Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2002 Oct 25;323(3):551-62.
doi: 10.1016/s0022-2836(02)00971-3.

Sequence variations within protein families are linearly related to structural variations

Affiliations

Sequence variations within protein families are linearly related to structural variations

Patrice Koehl et al. J Mol Biol. .

Abstract

It is commonly believed that similarities between the sequences of two proteins infer similarities between their structures. Sequence alignments reliably recognize pairs of protein of similar structures provided that the percentage sequence identity between their two sequences is sufficiently high. This distinction, however, is statistically less reliable when the percentage sequence identity is lower than 30% and little is known then about the detailed relationship between the two measures of similarity. Here, we investigate the inverse correlation between structural similarity and sequence similarity on 12 protein structure families. We define the structure similarity between two proteins as the cRMS distance between their structures. The sequence similarity for a pair of proteins is measured as the mean distance between the sequences in the subsets of sequence space compatible with their structures. We obtain an approximation of the sequence space compatible with a protein by designing a collection of protein sequences both stable and specific to the structure of that protein. Using these measures of sequence and structure similarities, we find that structural changes within a protein family are linearly related to changes in sequence similarity.

PubMed Disclaimer

Figures

Figure 1
Figure 1
comparison of the folds of protein G, the B1 domain of streptococcal protein G (PDB code 1pgb), protein L, the B1 domain of P. magnus (PDB code 2ptl), and protein ROP, a transcription regulator of E. coli (PDB code 1rop). Only the last 61 residues of protein L are shown, since the first 17 residues have no equivalence in protein G. The drawings of the proteins were generated using MOLSCRIPT.
Figure 2
Figure 2
Comparing the subsets of sequence space compatible with protein G and protein L. The native sequences of protein G and protein L have no detectable similarities, while their structures are very similar (cRMS = 1.9 Å). Two subsets of 100 sequences are designed for these two proteins using the SSA procedure described in Methods. The sequences in each subset have the same amino acid composition as the corresponding native sequence. The mean sequence identity 〈IS〉 and the mean sequence similarity 〈S50〉 on the basis of the Blosum 50 matrix between these two subsets are plotted as a function of the cycle of the SSA design procedure. The vertical bars show the standard deviations of the corresponding distributions. Monotonous increases of both measures are observed, indicating a convergence in the sequence information contained in the two proteins. For comparison, the percentage sequence identity and similarity score between the native sequences of proteins G and L are shown as discontinuous lines.
Figure 3
Figure 3
Comparing the subsets of sequence space compatible with protein G and protein ROP. Protein G and protein L have no detectable sequence or structure similarities. Two subsets of 100 sequences are designed for these two proteins using the SSA procedure described in Methods. The mean sequence identity 〈IS〉 and the mean sequence similarity 〈S50〉 based on the Blosum 50 matrix between these two subsets are plotted as a function of the cycle of the SSA design procedure. No convergence is sequence space is observed.
Figure 4
Figure 4
Five SH3 domains. The drawings of the proteins were generated using MOLSCRIPT.
Figure 5
Figure 5
Comparing the subsets of sequence space compatible with the five SH3 domains shown in Figure 4. Five subsets of 100 sequences are designed for the five domains, using the SSA procedure. The mean sequence similarities 〈S50〉 between the set generated for the SH3 domain of spectrin, 1SHG, and all four other subsets are plotted versus the cycle of SSA. The mean similarity score is found to be higher when the cRMS between the two structures is low (i.e. for the two pairs 1SHG-1SHF and 1SHG-1BU1), and conversely lower when the cRMS is high (i.e. for the two pairs 1SHG-1IHV and 1SHG-1BYM).
Figure 6
Figure 6
The sequence similarity between two SH3 domains is correlated with their structure similarity. The mean sequence identity 〈IS〉 (a) and the mean sequence similarity 〈S50〉 (b) between the converged sets of sequences designed for two SH3 domains are plotted versus the cRMS distance between the structures of these domains. Structural alignments were generated using STRUCTAL.
Figure 7
Figure 7
The relationship between sequence similarity and structure similarity is studied over 14 protein structure families (see Table 1). (a) For each pair of proteins in each family, the percentage sequence identity between their native sequences ISNAT is plotted versus the cRMS distance between their structures. A least-squares fit to the data gives the relationship cRMS = 4.02 exp(−0.0222ISNAT). The continuous line shows the fit to the data, while the discontinuous line shows the relationship between ISNAT and cRMS predicted from equation (1) derived from Chothia & Lesk. (b) The sequence of each protein is given as input to a FASTA search for sequence similarity over the database of protein sequences derived from the PDB. For each pair of proteins considered in (a), the raw FASTA score is plotted versus the cRMS distance. A non-linear between the FASTA score and cRMS is observed. (c) Sets of 100 sequences are designed for all proteins in the three structure families. For each pair of proteins, the mean percentage sequence identity 〈IS〉 between the converged sets of sequences designed based on their structures is plotted versus the cRMS distance. A linear relationship is observed, with a correlation coefficient R = 0.80. The continuous line shows the fit to the data.

References

    1. Chothia C. One thousand fold families for the molecular biologist? Nature (London) 1992;357:543. - PubMed
    1. Orengo CA, Jones DT, Thornton JM. Protein superfamilies and domain superfolds. Nature (London) 1994;372:631–634. - PubMed
    1. Govindarajan S, Recabarren R, Goldstein R. Estimating the total number of protein folds. J Mol Biol. 1999;35:408–414. - PubMed
    1. Wang ZX. A re-estimation of the total numbers of protein folds and superfamilies. Protein Eng. 1998;11:621–626. - PubMed
    1. Gerstein M, Levitt M. Comprehensive assessment of automatic structural alignment against a manual standard; the scop classification of proteins. Protein Sci. 1998;7:445–456. - PMC - PubMed

Publication types

Substances

LinkOut - more resources