Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Jun 26:8:191.
doi: 10.1186/1471-2164-8-191.

C-terminal motif prediction in eukaryotic proteomes using comparative genomics and statistical over-representation across protein families

Affiliations

C-terminal motif prediction in eukaryotic proteomes using comparative genomics and statistical over-representation across protein families

Ryan S Austin et al. BMC Genomics. .

Abstract

Background: The carboxy termini of proteins are a frequent site of activity for a variety of biologically important functions, ranging from post-translational modification to protein targeting. Several short peptide motifs involved in protein sorting roles and dependent upon their proximity to the C-terminus for proper function have already been characterized. As a limited number of such motifs have been identified, the potential exists for genome-wide statistical analysis and comparative genomics to reveal novel peptide signatures functioning in a C-terminal dependent manner. We have applied a novel methodology to the prediction of C-terminal-anchored peptide motifs involving a simple z-statistic and several techniques for improving the signal-to-noise ratio.

Results: We examined the statistical over-representation of position-specific C-terminal tripeptides in 7 eukaryotic proteomes. Sequence randomization models and simple-sequence masking were applied to the successful reduction of background noise. Similarly, as C-terminal homology among members of large protein families may artificially inflate tripeptide counts in an irrelevant and obfuscating manner, gene-family clustering was performed prior to the analysis in order to assess tripeptide over-representation across protein families as opposed to across all proteins. Finally, comparative genomics was used to identify tripeptides significantly occurring in multiple species. This approach has been able to predict, to our knowledge, all C-terminally anchored targeting motifs present in the literature. These include the PTS1 peroxisomal targeting signal (SKL*), the ER-retention signal (K/HDEL*), the ER-retrieval signal for membrane bound proteins (KKxx*), the prenylation signal (CC*) and the CaaX box prenylation motif. In addition to a high statistical over-representation of these known motifs, a collection of significant tripeptides with a high propensity for biological function exists between species, among kingdoms and across eukaryotes. Motifs of note include a serine-acidic peptide (DSD*) as well as several lysine enriched motifs found in nearly all eukaryotic genomes examined.

Conclusion: We have successfully generated a high confidence representation of eukaryotic motifs anchored at the C-terminus. A high incidence of true-positives in our results suggests that several previously unidentified tripeptide patterns are strong candidates for representing novel peptide motifs of a widely employed nature in the C-terminal biology of eukaryotes. Our application of comparative genomics, statistical over-representation and the adjustment for protein family homology has generated several hypotheses concerning the C-terminal topology as it pertains to sorting and potential protein interaction signals. This approach to background reduction could be expanded for application to protein motif prediction in the protein interior. A parallel N-terminal analysis is presented as supplementary data.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Flowchart of the SOCT pipeline. A combination of filters and pre-processing was performed against individual proteomes to obtain a comprehensive set of z-statistics for each possible tripeptide at all positions from the C-terminal end to 100 residues in from the C-terminus. Programs and scripts for data analysis are represented as barred boxes, while resulting datasets are depicted as polygons.
Figure 2
Figure 2
Position-specific abundance of SOCTs in A. thaliana. Graphical depictions of the number of statistically over-represented C-terminal tripeptides (z ≥ 3) occurring in the C-terminal region (-3 to -100). A. The unfiltered assessment of statistical over-representation in the C-terminus, as compared to a randomized data set control. B. The reduction in site-specific SOCT abundance after successive rounds of filtering measures including sequence masking, protein family adjustment and the stipulation of at least 10 occurrences for each SOCT.
Figure 3
Figure 3
SOCT intersections between species. Intersections of statistically over-represented tripeptides at the C-terminus of A. the two plant species (A. thaliana, O. sativa), B. the two lower animals (C. elegans, D. melanogaster) and C. the two mammalian proteomes (H. sapiens, M. musculus). The SOCT abundance at each C-terminal position is graphed for each species with the the number of commonly occurring SOCTs between the two species depicted with blue boxes.
Figure 4
Figure 4
Heatmap of SOCTs intersected across all genomes examined. SOCTs present in at least two species and occurring in at least 10 genes in each proteome represented in two blocks of heatmapped z-scores. Positions for the extreme terminal end (-3) and one position in (-4) are shown on the left and right respectively. SOCTs of interest are sorted in increasing significance row-wise with columns listing the species. Tripeptides matching characterized consensus sequences are highlited. Generated with Heatmapper [55].

References

    1. Chung JJ, Shikano S, Hanyu Y, Li M. Functional diversity of protein C-termini: more than zipcoding? Trends Cell Biol. 2002;12:146–150. doi: 10.1016/S0962-8924(01)02241-3. - DOI - PubMed
    1. Zhang FL, Casey PJ. Protein prenylation: molecular mechanisms and functional consequences. Annu Rev Biochem. 1996;65:241–269. doi: 10.1146/annurev.bi.65.070196.001325. - DOI - PubMed
    1. Gould SJ, Collins CS. Opinion: peroxisomal-protein import: is it really that complex? Nat Rev Mol Cell Biol. 2002;3:382–389. doi: 10.1038/nrm807. - DOI - PubMed
    1. Teasdale RD, Jackson MR. Signal-mediated sorting of membrane proteins between the endoplasmic reticulum and the golgi apparatus. Annu Rev Cell Dev Biol. 1996;12:27–54. doi: 10.1146/annurev.cellbio.12.1.27. - DOI - PubMed
    1. Mullen RT, Lee MS, Flynn CR, Trelease RN. Diverse amino acid residues function within the type 1 peroxisomal targeting signal. Implications for the role of accessory residues upstream of the type 1 peroxisomal targeting signal. Plant Physiol. 1997;115:881–889. doi: 10.1104/pp.115.3.881. - DOI - PMC - PubMed

Publication types