Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2002 Dec;11(12):2836-47.
doi: 10.1110/ps.0207402.

Sequence conserved for subcellular localization

Affiliations

Sequence conserved for subcellular localization

Rajesh Nair et al. Protein Sci. 2002 Dec.

Abstract

The more proteins diverged in sequence, the more difficult it becomes for bioinformatics to infer similarities of protein function and structure from sequence. The precise thresholds used in automated genome annotations depend on the particular aspect of protein function transferred by homology. Here, we presented the first large-scale analysis of the relation between sequence similarity and identity in subcellular localization. Three results stood out: (1) The subcellular compartment is generally more conserved than what might have been expected given that short sequence motifs like nuclear localization signals can alter the native compartment; (2) the sequence conservation of localization is similar between different compartments; and (3) it is similar to the conservation of structure and enzymatic activity. In particular, we found the transition between the regions of conserved and nonconserved localization to be very sharp, although the thresholds for conservation were less well defined than for structure and enzymatic activity. We found that a simple measure for sequence similarity accounting for pairwise sequence identity and alignment length, the HSSP distance, distinguished accurately between protein pairs of identical and different localizations. In fact, BLAST expectation values outperformed the HSSP distance only for alignments in the subtwilight zone. We succeeded in slightly improving the accuracy of inferring localization through homology by fine tuning the thresholds. Finally, we applied our results to the entire SWISS-PROT database and five entirely sequenced eukaryotes.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Transition from safe over twilight to midnight zone of protein comparisons. Alignment methods maximize the sequence similarity between two proteins. When we want to translate these levels of sequence similarity to conclusions about similarity in structure/function, we can distinguish three major regions; the boundaries between these are not well defined. (1) Safe zone: All protein pairs in this region have similar structure/function, that is, sequence similarity implies similarity in structure/function. (2) Twilight zone: Most pairs in this region have similar structure/function. (3) Midnight zone: Whereas many of the pairs in this region may have similar structure/function, most do not. The curves illustrate accuracy (or specificity, black line) and coverage (or selectivity, grey line); the x-axis gives the pairwise sequence similarity, the y-axis the percentage of pairs that are similar above the given threshold (accuracy) and the percentage of similar pairs that are found above the given threshold (coverage). These sketched curves point out that there is a trade off between accuracy and coverage; whereas the safe zone is defined by 100% accuracy, we typically find only a few of the pairs with similar structure/function in this region of sequence similarity (low coverage). On the other extreme end, we find many pairs of similar structure/function in the midnight zone (high coverage). However, the accuracy is very low. Obviously, the choice of appropriate thresholds constitutes a balance between the Skylla of 100% accuracy, no homolog found and the Charibdis of many putative homologs found, most are not homologous. The particular shape of the curves that describe accuracy and coverage depends on the problem at hand, that is, on the particular feature of biological similarity that we try to infer (Fig. 6 ▶ compares the transition for a variety of features). Here, we focus on the problem of establishing thresholds that allow inferring subcellular localization through sequence similarity.
Fig. 2.
Fig. 2.
Sequence conservation for major classes of subcellular localization. For different thresholds in terms of the HSSP distance (Eq. 1), we compiled the levels of cumulative accuracy (Eq. 3) and cumulative coverage (Eq. 4). The major compartments had very similar curves for cumulative accuracy. The transition from the safe zone to the twilight zone occurred around HSSP distances of 4. In contrast to the perfect conservation of structure, the cumulative accuracy (A) was observed to be as low as 80% (for mitochondrial proteins) in the safe zone. The cumulative coverage (B) showed greater variation among the different compartments; the transition for coverage occurred between HSSP distance 5 and −5. The coverage remained significantly low even at very low levels of accuracy.
Fig. 3.
Fig. 3.
Average conservation of subcellular localization. Graphs A, B, C show the performance of pairwise BLAST searches for the biased set, whereas graphs D, E, F show the performance of pairwise BLAST and PSI-BLAST searches on the sequence-unique subset. The filled symbols show cumulative accuracy and cumulative coverage (Eq. 3) for pairwise BLAST; open symbols give the results from PSI-BLAST searches. For the biased set, the cumulative coverage is 1% corresponding to the identification of ∼274K pairs from identical localization (true pairs), whereas for the sequence-unique subset, a cumulative coverage of 1% corresponds to the identification of ∼21K true pairs. Conservation thresholds for BLAST and PSI-BLAST are indicated by open and filled arrows, respectively. For HSSP distance (C,F), the conservation threshold using BLAST was at HSSP distance = 4 (open arrow) for the biased and sequence-unique sets, whereas by using PSI-BLAST, the conservation threshold was at HSSP distance = 0 (filled arrow) for the sequence-unique set. The cumulative accuracy and cumulative coverage when using BLAST for the sequence-unique set was 87% and 0.36%, respectively, and for PSI-BLAST, it was 91% and 0.4%, respectively. For the cumulative accuracy vs percent sequence identity graphs (A,D), no sharp conservation thresholds could be established. The percent sequence identity graphs showed the largest variation for the biased and sequence-unique sets. In contrast, the graphs for BLAST E-values (B,E) and HSSP distances (C,F, Eq. 1) were similar for the biased and the sequence-unique set. The conservation thresholds for PSI-BLAST occurred at a lower threshold than that for pairwise BLAST (D,E,F). The middle graphs plot the logarithm of the BLAST E-values (log to the base e). Note that BLAST E-values below 10−200 did not suffice to safely infer localization. In contrast, at very high HSSP distances and sequence identities, localization could be reliably transferred.
Fig. 4.
Fig. 4.
Performance for different measures of sequence similarity. The black lines and open symbols show cumulative coverage vs cumulative accuracy for PSI-BLAST searches, whereas grey lines and shaded symbols show the same for pairwise BLAST (A,B). The figure plots data only for cumulative accuracy above 80%, which is well below the threshold for conservation of localization. (A) For HSSP distance (circles) and percent sequence identity (squares), PSI-BLAST vastly outperforms pairwise BLAST. However, using BLAST E-values, both BLAST and PSI-BLAST gave comparable performance at the conservation threshold (86% cumulative accuracy in figure). For both pairwise BLAST and PSI-BLAST, scoring the alignments using HSSP distance (Eq. 1) gave the best coverage vs accuracy graphs. Using HSSP distance for PSI-BLAST, alignments gave overall best performance. (B) For both pairwise BLAST and PSI-BLAST, using scaled distance (Eq. 2) from the HSSP curve improved performance compared with HSSP distance. The performance was worse when perpendicular distance from the HSSP curve was used. Overall, using PSI-BLAST alignments and scaled distance from the HSSP curve gave best performance. The curves for cumulative accuracy and coverage for the scaled HSSP distance (C) were similar to those obtained for the standard HSSP distance (Fig. 3F ▶).
Fig. 5.
Fig. 5.
Percentage pairwise sequence identity vs. length of alignment. The grey plus signs represent protein pairs experimentally observed in identical compartments, whereas the black squares represent pairs observed in different compartments. The grey line is the HSSP curve (Eq. 1) optimized to describe the sequence conservation of protein structure (Rost 1999). The HSSP curve was surprisingly accurate at reproducing the curve that may best separate proteins with identical localization from those of different localization.
Fig. 6.
Fig. 6.
Conservation of function and structure. (A) We aligned all proteins in 30 entirely sequenced organisms with PSI-BLAST against all known proteins. We considered all pairs identified above PSI-BLAST expectation values of 10−3 to constitute the respective family (100%). We plotted the percentage of proteins found at a given threshold for sequence similarity. Both for measuring sequence similarity by pairwise sequence identity (lower x-axis, thin line with triangles), or PSI-BLAST expectation values (upper x-axis, thick line with crosses), the number of members of a group increased nonlinearly at some given threshold. (B,C) Sequence conservation of four different features of protein structure and function. The data for the conservation of protein structure (thick grey line with crossed boxes) was compiled according to Rost (1999). The data for the conservation of enzymatic activity was compiled according to Rost (2002). We identified similarity in enzymatic activity by the identity of the first EC digit distinguishing six classes (oxireductases, transferases, hydrolases, lyases, isomerases, and ligases; thin grey line with triangles), and by the identity of the detailed activity (all four digits conserved, thin black lines with crosses). Finally, we used the data set of subcellular localization explored in this study. Sequence similarity was measured by the HSSP distance (B) and by the BLAST expectation values (C). All comparisons based on pairwise BLAST alignments.

References

    1. Abagyan, R.A. and Batalov, S. 1997. Do aligned sequences share the same fold? J. Mol. Biol. 273 355–368. - PubMed
    1. Alexandrov, N.N. and Soloveyev, V.V. 1998. Statistical significance of ungapped sequence alignments. In HICCS `98: Pacific symposium on biocomputing `98. (eds. R.B. Altman, A.K. Dunker, L. Hunter, and T.E. Klein), pp. 463–472. World Scientific, Maui, Hawaii. - PubMed
    1. Altschul, S.F. 1993. A protein alignment scoring system sensitive at all evolutionary distances. J. Mol. Evol. 36 290–300. - PubMed
    1. Altschul, S.F. and Gish, W. 1996. Local alignment statistics. Meth. Enzymol. 266 460–480. - PubMed
    1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215 403–410. - PubMed

Publication types

LinkOut - more resources