. 2009 Nov 24:1:101-26.

doi: 10.4137/bbi.s415.

Peptide vocabulary analysis reveals ultra-conservation and homonymity in protein sequences

Derek Gatherer¹

Affiliations

PMID: 20066129
PMCID: PMC2789693
DOI: 10.4137/bbi.s415

Peptide vocabulary analysis reveals ultra-conservation and homonymity in protein sequences

Derek Gatherer. Bioinform Biol Insights. 2009.

. 2009 Nov 24:1:101-26.

doi: 10.4137/bbi.s415.

Author

Derek Gatherer¹

Affiliation

¹ MRC Virology Unit, Institute of Virology, Church Street, Glasgow G11 5JR UK.

PMID: 20066129
PMCID: PMC2789693
DOI: 10.4137/bbi.s415

Abstract

A new algorithm is presented for vocabulary analysis (word detection) in texts of human origin. It performs at 60%-70% overall accuracy and greater than 80% accuracy for longer words, and approximately 85% sensitivity on Alice in Wonderland, a considerable improvement on previous methods. When applied to protein sequences, it detects short sequences analogous to words in human texts, i.e. intolerant to changes in spelling (mutation), and relatively context-independent in their meaning (function). Some of these are homonyms of up to 7 amino acids, which can assume different structures in different proteins. Others are ultra-conserved stretches of up to 18 amino acids within proteins of less than 40% overall identity, reflecting extreme constraint or convergent evolution. Different species are found to have qualitatively different major peptide vocabularies, e.g. some are dominated by large gene families, while others are rich in simple repeats or dominated by internally repetitive proteins. This suggests the possibility of a peptide vocabulary signature, analogous to genome signatures in DNA. Homonyms may be useful in detecting convergent evolution and positive selection in protein evolution. Ultra-conserved words may be useful in identifying structures intolerant to substitution over long periods of evolutionary time.

Keywords: bioinformatics; gene families; genome signature; motif; peptide conservation; peptide homonymity; peptide vocabulary; protein structure; vocabulary analysis; word detection.

PubMed Disclaimer

Figures

**Figure 1.**
RS-ESM performed on *Alice in Wonderland*. Number of hits plotted against k. TP: true positives. Sen2 is also plotted (×100) to show its improvement at higher levels of k.

**Figure 2.**
Number of candidate DWoPs (×1000, kword) plotted against length of text (kchar) for RS-ESM, k = 6–18. Alice: *Alice in Wonderland*, Gulliver: *Gulliver’s Travels*, Oliver: *Oliver Twist*, Chaucer: *Canterbury Tales* in 19th century translation, Origin: *Origin of Species*, Don Quixote: 19th century English translation of same, KJB: *King James Bible*.

**Figure 4.**
The first 150 residues of the alignment of the four sequences containing the word LGGTCVNVGCVPKK (shaded). The proteins are identified by their PDB designations—1GRT: human glutathione reductase; 1GER: *E. coli* glutathione reductase; 1BZL: *Trypanosoma cruzi* trypanothione reductase; 1TYP: *Crithidia fasciculata* trypanothione reductase.

**Figure 5.**
Superposition of sequence LGGTCVNVGCVPKK in the 4 proteins aligned in Figure 4. The helical backbone is shown in black. Despite the variability of the other parts of these proteins, LGGTCVNVGCVPKK represents a region of extreme structural, and presumably functional, conservation between *E. coli*, humans and trypanosomes.

**Figure 3.**
NRL3D_63 (solid lines) and its shuffled equivalent (dotted lines), tested with both RS-ESM (squares) and CW-ESM (triangles). The logarithm of the number of hits, n, is plotted against k. Log(0) is arbitrarily designated zero. Pseudocounts are therefore added when n = 1.

**Figure 6.**
The first 150 residues of the alignment of the three sequences containing the word AFLGIPFAEPPVG (shaded). 1EVE: *Torpedo californica* acetylcholinesterase; 1MAAD: mouse acetylcholinesterase chain D; 1LPM: *Candida rugosa* lipase.

**Figure 7.**
Superposition of SLGDRVT in mouse antibody proteins 1JRHL (white) and 1NMCL (black). Backbone traces are rendered as fine lines.

**Figure 8.**
Comparison of SLGDRVT word in *Streptomyces albus* lactamase 1BSG (white) and mouse antibody protein 1NMCL (black). Backbone traces are rendered as fine lines.

**Figure 9.**
Candidate words of k = 6 to 18, detected using RS-ESM, against number of proteins for 35 eubacterial (black circles) and 28 archaeal (grey circles) proteomes.

**Figure 10.**
N-terminal region of 3 *C. muridarum* transporter proteins showing the GPNGAGKSTL word (shaded).

**Figure 11.**
HHRIKNNLQVISSLLDL (shaded) in part of a histidine kinase alignment.

**Figure 12.**
VLVIGA in (from left to right) 1D4O.A, 1BFD and 1AD3.A. The backbone trace is drawn as a black line.

See this image and copyright information in PMC

References

1. Anisimova M, Yang Z. Multiple Hypothesis Testing to Detect Lineages under Positive Selection that Affects Only a Few Sites. Mol. Biol. Evol. 2007;24:1219–28. - PubMed
1. Apostolico A, Bock ME, Lonardi S. Monotony of surprise and large-scale quest for unusual words. J. Comp. Biol. 2003;10:283–311. - PubMed
1. Bains W. Hexanucleotide frequency database. Comp. Appl. Biosci. 1997;13:107–8. - PubMed
1. Beckmann JS, Brendel V, Trifonov EN. Intervening sequences exhibit distinct vocabulary. J. Biomol. Struct. Dyn. 1986;4:391–400. - PubMed
1. Bentolila S. A grammar describing ‘biological binding operators’ to model gene regulation. Biochimie. 1996;78:335–50. - PubMed

Grants and funding

MC_U130169960/MRC_/Medical Research Council/United Kingdom

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Peptide vocabulary analysis reveals ultra-conservation and homonymity in protein sequences

Affiliation

Peptide vocabulary analysis reveals ultra-conservation and homonymity in protein sequences

Author

Affiliation

Abstract

Figures

Similar articles

References

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

References

Related information

Grants and funding

LinkOut - more resources

Full Text Sources