Peptide vocabulary analysis reveals ultra-conservation and homonymity in protein sequences
- PMID: 20066129
- PMCID: PMC2789693
- DOI: 10.4137/bbi.s415
Peptide vocabulary analysis reveals ultra-conservation and homonymity in protein sequences
Abstract
A new algorithm is presented for vocabulary analysis (word detection) in texts of human origin. It performs at 60%-70% overall accuracy and greater than 80% accuracy for longer words, and approximately 85% sensitivity on Alice in Wonderland, a considerable improvement on previous methods. When applied to protein sequences, it detects short sequences analogous to words in human texts, i.e. intolerant to changes in spelling (mutation), and relatively context-independent in their meaning (function). Some of these are homonyms of up to 7 amino acids, which can assume different structures in different proteins. Others are ultra-conserved stretches of up to 18 amino acids within proteins of less than 40% overall identity, reflecting extreme constraint or convergent evolution. Different species are found to have qualitatively different major peptide vocabularies, e.g. some are dominated by large gene families, while others are rich in simple repeats or dominated by internally repetitive proteins. This suggests the possibility of a peptide vocabulary signature, analogous to genome signatures in DNA. Homonyms may be useful in detecting convergent evolution and positive selection in protein evolution. Ultra-conserved words may be useful in identifying structures intolerant to substitution over long periods of evolutionary time.
Keywords: bioinformatics; gene families; genome signature; motif; peptide conservation; peptide homonymity; peptide vocabulary; protein structure; vocabulary analysis; word detection.
Figures












Similar articles
-
Word type and modality in the emerging expressive vocabularies of preschool children with Down syndrome.Int J Lang Commun Disord. 2023 May;58(3):864-878. doi: 10.1111/1460-6984.12828. Epub 2022 Dec 20. Int J Lang Commun Disord. 2023. PMID: 36537162
-
Construction and improvement of English vocabulary learning model integrating spiking neural network and convolutional long short-term memory algorithm.PLoS One. 2024 Mar 22;19(3):e0299425. doi: 10.1371/journal.pone.0299425. eCollection 2024. PLoS One. 2024. PMID: 38517859 Free PMC article.
-
The lexical profile of forestry academic texts: What does it take to understand a specialized discipline?PLoS One. 2024 Dec 30;19(12):e0315975. doi: 10.1371/journal.pone.0315975. eCollection 2024. PLoS One. 2024. PMID: 39774539 Free PMC article.
-
A comparison of homonym and novel word learning: the role of phonotactic probability and word frequency.J Child Lang. 2005 Nov;32(4):827-53. doi: 10.1017/s0305000905007099. J Child Lang. 2005. PMID: 16429713 Free PMC article.
-
DNA sequence analysis linguistic tools: contrast vocabularies, compositional spectra and linguistic complexity.Appl Bioinformatics. 2003;2(2):103-12. Appl Bioinformatics. 2003. PMID: 15130826 Review.
References
-
- Anisimova M, Yang Z. Multiple Hypothesis Testing to Detect Lineages under Positive Selection that Affects Only a Few Sites. Mol. Biol. Evol. 2007;24:1219–28. - PubMed
-
- Apostolico A, Bock ME, Lonardi S. Monotony of surprise and large-scale quest for unusual words. J. Comp. Biol. 2003;10:283–311. - PubMed
-
- Bains W. Hexanucleotide frequency database. Comp. Appl. Biosci. 1997;13:107–8. - PubMed
-
- Beckmann JS, Brendel V, Trifonov EN. Intervening sequences exhibit distinct vocabulary. J. Biomol. Struct. Dyn. 1986;4:391–400. - PubMed
-
- Bentolila S. A grammar describing ‘biological binding operators’ to model gene regulation. Biochimie. 1996;78:335–50. - PubMed
Grants and funding
LinkOut - more resources
Full Text Sources