Modeling the percolation of annotation errors in a database of protein sequences
- PMID: 12490449
- DOI: 10.1093/bioinformatics/18.12.1641
Modeling the percolation of annotation errors in a database of protein sequences
Abstract
Public sequence databases contain information on the sequence, structure and function of proteins. Genome sequencing projects have led to a rapid increase in protein sequence information, but reliable, experimentally verified, information on protein function lags a long way behind. To address this deficit, functional annotation in protein databases is often inferred by sequence similarity to homologous, annotated proteins, with the attendant possibility of error. Now, the functional annotation in these homologous proteins may itself have been acquired through sequence similarity to yet other proteins, and it is generally not possible to determine how the functional annotation of any given protein has been acquired. Thus the possibility of chains of misannotation arises, a process we term 'error percolation'. With some simple assumptions, we develop a dynamical probabilistic model for these misannotation chains. By exploring the consequences of the model for annotation quality it is evident that this iterative approach leads to a systematic deterioration of database quality.
Similar articles
-
About the use of protein models.Bioinformatics. 2002 Jul;18(7):934-8. doi: 10.1093/bioinformatics/18.7.934. Bioinformatics. 2002. PMID: 12117790
-
Mining sequence annotation databanks for association patterns.Bioinformatics. 2005 Nov 1;21 Suppl 3:iii49-57. doi: 10.1093/bioinformatics/bti1206. Bioinformatics. 2005. PMID: 16306393
-
Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation.Bioinformatics. 2003 Jul 1;19(10):1275-83. doi: 10.1093/bioinformatics/btg153. Bioinformatics. 2003. PMID: 12835272
-
Automatic annotation of protein function.Curr Opin Struct Biol. 2005 Jun;15(3):267-74. doi: 10.1016/j.sbi.2005.05.010. Curr Opin Struct Biol. 2005. PMID: 15922590 Review.
-
Predicting protein function from sequence and structural data.Curr Opin Struct Biol. 2005 Jun;15(3):275-84. doi: 10.1016/j.sbi.2005.04.003. Curr Opin Struct Biol. 2005. PMID: 15963890 Review.
Cited by
-
Strategies for reliable exploitation of evolutionary concepts in high throughput biology.Evol Bioinform Online. 2008 May 8;4:121-37. doi: 10.4137/ebo.s597. Evol Bioinform Online. 2008. PMID: 19204813 Free PMC article.
-
ANNIE: integrated de novo protein sequence annotation.Nucleic Acids Res. 2009 Jul;37(Web Server issue):W435-40. doi: 10.1093/nar/gkp254. Epub 2009 Apr 23. Nucleic Acids Res. 2009. PMID: 19389726 Free PMC article.
-
The enzymatic nature of an anonymous protein sequence cannot reliably be inferred from superfamily level structural information alone.Protein Sci. 2015 May;24(5):643-50. doi: 10.1002/pro.2635. Epub 2015 Jan 28. Protein Sci. 2015. PMID: 25559918 Free PMC article.
-
A new repeat-masking method enables specific detection of homologous sequences.Nucleic Acids Res. 2011 Mar;39(4):e23. doi: 10.1093/nar/gkq1212. Epub 2010 Nov 24. Nucleic Acids Res. 2011. PMID: 21109538 Free PMC article.
-
Automatic detection of false annotations via binary property clustering.BMC Bioinformatics. 2005 Mar 8;6:46. doi: 10.1186/1471-2105-6-46. BMC Bioinformatics. 2005. PMID: 15755318 Free PMC article.
Publication types
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources