. 2012 Sep 15;28(18):i562-i568.

doi: 10.1093/bioinformatics/bts372.

An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB

Michael J Bell¹, Colin S Gillespie, Daniel Swan, Phillip Lord

Affiliations

PMID: 22962482
PMCID: PMC3436799
DOI: 10.1093/bioinformatics/bts372

An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB

Michael J Bell et al. Bioinformatics. 2012.

. 2012 Sep 15;28(18):i562-i568.

doi: 10.1093/bioinformatics/bts372.

Authors

Michael J Bell¹, Colin S Gillespie, Daniel Swan, Phillip Lord

Affiliation

¹ School of Computing Science, Newcastle University, Newcastle-Upon-Tyne, NE1 7RU, UK.

PMID: 22962482
PMCID: PMC3436799
DOI: 10.1093/bioinformatics/bts372

Abstract

Motivation: Annotations are a key feature of many biological databases, used to convey our knowledge of a sequence to the reader. Ideally, annotations are curated manually, however manual curation is costly, time consuming and requires expert knowledge and training. Given these issues and the exponential increase of data, many databases implement automated annotation pipelines in an attempt to avoid un-annotated entries. Both manual and automated annotations vary in quality between databases and annotators, making assessment of annotation reliability problematic for users. The community lacks a generic measure for determining annotation quality and correctness, which we look at addressing within this article. Specifically we investigate word reuse within bulk textual annotations and relate this to Zipf's Principle of Least Effort. We use the UniProt Knowledgebase (UniProtKB) as a case study to demonstrate this approach since it allows us to compare annotation change, both over time and between automated and manually curated annotations.

Results: By applying power-law distributions to word reuse in annotation, we show clear trends in UniProtKB over time, which are consistent with existing studies of quality on free text English. Further, we show a clear distinction between manual and automated analysis and investigate cohorts of protein records as they mature. These results suggest that this approach holds distinct promise as a mechanism for judging annotation quality.

Availability: Source code is available at the authors website: http://homepages.cs.ncl.ac.uk/m.j.bell1/annotation.

Contact: phillip.lord@newcastle.ac.uk.

PubMed Disclaimer

Figures

**Fig. 1.**
Outline view of the data extraction process. (1) Initially we download a complete dataset for a given database version in flat file format. (2) We then extract the comment lines (lines beginning with ‘CC’, the comment indicator). (3) We remove comment blocks and properties [as defined in the UniProtKB manual (UniProt Consortium, 2011)], punctuation, ‘CC’, brackets and make words lower case, so as to treat them as case insensitive. (4) Finally, we count the individual words and update the occurrence of each word total count

**Fig. 2.**
Cumulative distributions of words for various Swiss-Prot and TrEMBL versions, shown with logarithmic scales. The size (number of words) is shown along the X-axis whereas the probability is shown on the Y-axis. A point on the graph represents the probability that a word will occur x or more times. For example, the upper left most point represents the probability of 1 (i.e. 10⁰) that a given word will occur once (i.e. 10⁰) or more times. A word must occur at least once to be included. Words occurring very frequently are presented in the bottom right of the graph. (a) Shows the resulting graphs for Swiss-Prot version 9 (November 1988) and Swiss-Prot version 37 (December 1998), with and without copyright. The distinct structure visible between x = 10⁴ and x = 10⁵ in Swiss-Prot version 37 (bottom left panel) is caused by the copyright statement declaration. Swiss-Prot version 9 operates as a control to show that the attempted removal of copyright has no effect where no copyright information is present. (b) Shows the data with fitted power-law distributions for an even subset of historical versions of Swiss-Prot and the co-ordinate release of TrEMBL

**Fig. 3.**
α values over time, for each version of Swiss-Prot and TrEMBL. The graph shows the difference in α value (with 95% credible region) from UniProtKB/Swiss-Prot version 16, for which the α value was 1.62. So, for example, Swiss-Prot version 9 has a difference of, approximately, 0.45. Therefore the resulting α for Swiss-Prot version 9 is around 2.07

**Fig. 4.**
Swiss-Prot (red circles) and TrEMBL (blue triangles). (a) Growth (number of entries) in Swiss-Prot and TrEMBL over time. (b) Average creation date over time for Swiss-Prot and TrEMBL. (c) Difference between release date and average creation date (i.e. age) over time

**Fig. 5.**
(a) Analysis of those entries that are new to a particular version of Swiss-Prot. (b) α value (with 95% credible region) for all entries in Swiss-Prot version 9 that are in UniProtKB version 15, all entries in UniProtKB version 15 that are in Swiss-Prot version 9, and all those in UniProtKB version 15, but not in Swiss-Prot version 9

See this image and copyright information in PMC

References

1. Adamic L.A., Huberman B.A. Zipf's law and the internet. Glottometrics. 2002;3:143–150.
1. Andorf C., et al. Exploring inconsistencies in genome-wide protein function annotations: a machine learning approach. BMC Bioinformatics. 2007;8:284+. - PMC - PubMed
1. Bairoch A., Apweiler R. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1998. Nucleic Acids Res. 1998;26:38–42. - PMC - PubMed
1. Bairoch A., Apweiler R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 2000;28:45–48. - PMC - PubMed
1. Balasubrahmanyan V.K., Naranan S. Quantitative linguistics and complex system studies. J. Quant. Linguisti. 1996;3:177–228.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB

Affiliation

An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Miscellaneous