Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Nov 24:1:101-26.
doi: 10.4137/bbi.s415.

Peptide vocabulary analysis reveals ultra-conservation and homonymity in protein sequences

Affiliations

Peptide vocabulary analysis reveals ultra-conservation and homonymity in protein sequences

Derek Gatherer. Bioinform Biol Insights. .

Abstract

A new algorithm is presented for vocabulary analysis (word detection) in texts of human origin. It performs at 60%-70% overall accuracy and greater than 80% accuracy for longer words, and approximately 85% sensitivity on Alice in Wonderland, a considerable improvement on previous methods. When applied to protein sequences, it detects short sequences analogous to words in human texts, i.e. intolerant to changes in spelling (mutation), and relatively context-independent in their meaning (function). Some of these are homonyms of up to 7 amino acids, which can assume different structures in different proteins. Others are ultra-conserved stretches of up to 18 amino acids within proteins of less than 40% overall identity, reflecting extreme constraint or convergent evolution. Different species are found to have qualitatively different major peptide vocabularies, e.g. some are dominated by large gene families, while others are rich in simple repeats or dominated by internally repetitive proteins. This suggests the possibility of a peptide vocabulary signature, analogous to genome signatures in DNA. Homonyms may be useful in detecting convergent evolution and positive selection in protein evolution. Ultra-conserved words may be useful in identifying structures intolerant to substitution over long periods of evolutionary time.

Keywords: bioinformatics; gene families; genome signature; motif; peptide conservation; peptide homonymity; peptide vocabulary; protein structure; vocabulary analysis; word detection.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
RS-ESM performed on Alice in Wonderland. Number of hits plotted against k. TP: true positives. Sen2 is also plotted (×100) to show its improvement at higher levels of k.
Figure 2.
Figure 2.
Number of candidate DWoPs (×1000, kword) plotted against length of text (kchar) for RS-ESM, k = 6–18. Alice: Alice in Wonderland, Gulliver: Gulliver’s Travels, Oliver: Oliver Twist, Chaucer: Canterbury Tales in 19th century translation, Origin: Origin of Species, Don Quixote: 19th century English translation of same, KJB: King James Bible.
Figure 4.
Figure 4.
The first 150 residues of the alignment of the four sequences containing the word LGGTCVNVGCVPKK (shaded). The proteins are identified by their PDB designations—1GRT: human glutathione reductase; 1GER: E. coli glutathione reductase; 1BZL: Trypanosoma cruzi trypanothione reductase; 1TYP: Crithidia fasciculata trypanothione reductase.
Figure 5.
Figure 5.
Superposition of sequence LGGTCVNVGCVPKK in the 4 proteins aligned in Figure 4. The helical backbone is shown in black. Despite the variability of the other parts of these proteins, LGGTCVNVGCVPKK represents a region of extreme structural, and presumably functional, conservation between E. coli, humans and trypanosomes.
Figure 3.
Figure 3.
NRL3D_63 (solid lines) and its shuffled equivalent (dotted lines), tested with both RS-ESM (squares) and CW-ESM (triangles). The logarithm of the number of hits, n, is plotted against k. Log(0) is arbitrarily designated zero. Pseudocounts are therefore added when n = 1.
Figure 6.
Figure 6.
The first 150 residues of the alignment of the three sequences containing the word AFLGIPFAEPPVG (shaded). 1EVE: Torpedo californica acetylcholinesterase; 1MAAD: mouse acetylcholinesterase chain D; 1LPM: Candida rugosa lipase.
Figure 7.
Figure 7.
Superposition of SLGDRVT in mouse antibody proteins 1JRHL (white) and 1NMCL (black). Backbone traces are rendered as fine lines.
Figure 8.
Figure 8.
Comparison of SLGDRVT word in Streptomyces albus lactamase 1BSG (white) and mouse antibody protein 1NMCL (black). Backbone traces are rendered as fine lines.
Figure 9.
Figure 9.
Candidate words of k = 6 to 18, detected using RS-ESM, against number of proteins for 35 eubacterial (black circles) and 28 archaeal (grey circles) proteomes.
Figure 10.
Figure 10.
N-terminal region of 3 C. muridarum transporter proteins showing the GPNGAGKSTL word (shaded).
Figure 11.
Figure 11.
HHRIKNNLQVISSLLDL (shaded) in part of a histidine kinase alignment.
Figure 12.
Figure 12.
VLVIGA in (from left to right) 1D4O.A, 1BFD and 1AD3.A. The backbone trace is drawn as a black line.

Similar articles

References

    1. Anisimova M, Yang Z. Multiple Hypothesis Testing to Detect Lineages under Positive Selection that Affects Only a Few Sites. Mol. Biol. Evol. 2007;24:1219–28. - PubMed
    1. Apostolico A, Bock ME, Lonardi S. Monotony of surprise and large-scale quest for unusual words. J. Comp. Biol. 2003;10:283–311. - PubMed
    1. Bains W. Hexanucleotide frequency database. Comp. Appl. Biosci. 1997;13:107–8. - PubMed
    1. Beckmann JS, Brendel V, Trifonov EN. Intervening sequences exhibit distinct vocabulary. J. Biomol. Struct. Dyn. 1986;4:391–400. - PubMed
    1. Bentolila S. A grammar describing ‘biological binding operators’ to model gene regulation. Biochimie. 1996;78:335–50. - PubMed

LinkOut - more resources