Biases in the experimental annotations of protein function and their effect on our understanding of protein function space

Alexandra M Schnoes¹, David C Ream, Alexander W Thorman, Patricia C Babbitt, Iddo Friedberg

Affiliations

PMID: 23737737
PMCID: PMC3667760
DOI: 10.1371/journal.pcbi.1003063

Biases in the experimental annotations of protein function and their effect on our understanding of protein function space

Alexandra M Schnoes et al. PLoS Comput Biol. 2013.

. 2013;9(5):e1003063.

doi: 10.1371/journal.pcbi.1003063. Epub 2013 May 30.

Authors

Alexandra M Schnoes¹, David C Ream, Alexander W Thorman, Patricia C Babbitt, Iddo Friedberg

Affiliation

¹ Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, California, USA.

PMID: 23737737
PMCID: PMC3667760
DOI: 10.1371/journal.pcbi.1003063

Abstract

The ongoing functional annotation of proteins relies upon the work of curators to capture experimental findings from scientific literature and apply them to protein sequence and structure data. However, with the increasing use of high-throughput experimental assays, a small number of experimental studies dominate the functional protein annotations collected in databases. Here, we investigate just how prevalent is the "few articles - many proteins" phenomenon. We examine the experimentally validated annotation of proteins provided by several groups in the GO Consortium, and show that the distribution of proteins per published study is exponential, with 0.14% of articles providing the source of annotations for 25% of the proteins in the UniProt-GOA compilation. Since each of the dominant articles describes the use of an assay that can find only one function or a small group of functions, this leads to substantial biases in what we know about the function of many proteins. Mass-spectrometry, microscopy and RNAi experiments dominate high throughput experiments. Consequently, the functional information derived from these experiments is mostly of the subcellular location of proteins, and of the participation of proteins in embryonic developmental pathways. For some organisms, the information provided by different studies overlap by a large amount. We also show that the information provided by high throughput experiments is less specific than those provided by low throughput experiments. Given the experimental techniques available, certain biases in protein function annotation due to high-throughput experiments are unavoidable. Knowing that these biases exist and understanding their characteristics and extent is important for database curators, developers of function annotation programs, and anyone who uses protein function annotation data to plan experiments.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Figure 1. Distribution of the number of proteins annotated per article.**
X-axis: number of annotating articles. Y-axis: number of annotated proteins. The distribution was found to be logarithmic with a significant () linear fit to the log-log plot. The data came from 76137 articles annotating 256033 proteins with GO experimental evidence codes, in Uniprot-GOA 12/2011.

formula image — **Figure 1. Distribution of the number of proteins annotated per article.**
X-axis: number of annotating articles. Y-axis: number of annotated proteins. The distribution was found to be logarithmic with a significant () linear fit to the log-log plot. The data came from 76137 articles annotating 256033 proteins with GO experimental evidence codes, in Uniprot-GOA 12/2011.

**Figure 2. Relative contribution of top-50 articles to the annotation of major model organisms.**
The length of each bar represents the percentage of proteins annotated by the top-50 articles in a given organism by a given GO term. GO terms that are present in more than one species are highlighted.

**Figure 3. Redundancy in proteins described by the top-50 articles.**
A circle represents the sum total of articles annotating each organism. Each colored arch is composed of all the proteins in a single article. A line is drawn between any two points on the circle if the proteins they represent have 100% sequence identity. A black line is drawn if they are annotated with a different ontology (for example, in one article the protein is annotated with the MFO, and in another article with BPO); a red line if they are annotated in the same ontology. Example: *S. pombe* is described by two articles, one with few protein (light arch on bottom) and one with many (dark arch encompassing most of circle). Many of the same proteins are annotated by both articles. See Table 2 for numbers.

**Figure 4. Information provided by articles depending on the number of proteins the articles annotate.**
Articles are grouped into cohorts: 1: one protein annotated by article; : more than 1, up to 10 annotated; : more than 10, less than 100 annotated; : 100 or more proteins annotated per article. Blue bars: Molecular Function ontology; Green bars: Biological Process ontology; Red bars: Cellular Component ontology. Information is gauged by A: Information Content and B: GO depth. See text for details.

See this image and copyright information in PMC

References

1. Friedberg I (2006) Automated protein function prediction–the genomic challenge. Brief Bioinform 7: 225–242. - PubMed
1. Schnoes AM, Brown SD, Dodevski I, Babbitt PC (2009) Annotation error in public databases: Misannotation of molecular function in enzyme superfamilies. PLoS Comput Biol 5: e1000605+. - PMC - PubMed
1. Erdin S, Lisewski AM, Lichtarge O (2011) Protein function prediction: towards integration of similarity metrics. Current Opinion in Structural Biology 21: 180–188. - PMC - PubMed
1. Rentzsch R, Orengo CA (2009) Protein function prediction the power of multiplicity. Trends in Biotechnology 27: 210–219. - PubMed
1. Sthl PL, Lundeberg J (2012) Toward the single-hour high-quality genome. Annual Review of Biochemistry 81: 359–378. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Biases in the experimental annotations of protein function and their effect on our understanding of protein function space

Affiliation

Biases in the experimental annotations of protein function and their effect on our understanding of protein function space

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources