Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Mar;10(3):221-7.
doi: 10.1038/nmeth.2340. Epub 2013 Jan 27.

A large-scale evaluation of computational protein function prediction

Predrag Radivojac  1 Wyatt T ClarkTal Ronnen OronAlexandra M SchnoesTobias WittkopArtem SokolovKiley GraimChristopher FunkKarin VerspoorAsa Ben-HurGaurav PandeyJeffrey M YunesAmeet S TalwalkarSusanna RepoMichael L SouzaDamiano PiovesanRita CasadioZheng WangJianlin ChengHai FangJulian GoughPatrik KoskinenPetri TörönenJussi Nokso-KoivistoLiisa HolmDomenico CozzettoDaniel W A BuchanKevin BrysonDavid T JonesBhakti LimayeHarshal InamdarAvik DattaSunitha K ManjariRajendra JoshiMeghana ChitaleDaisuke KiharaAndreas M LisewskiSerkan ErdinEric VennerOlivier LichtargeRobert RentzschHaixuan YangAlfonso E RomeroPrajwal BhatAlberto PaccanaroTobias HampRebecca KaßnerStefan SeemayerEsmeralda VicedoChristian SchaeferDominik AchtenFlorian AuerAriane BoehmTatjana BraunMaximilian HechtMark HeronPeter HönigschmidThomas A HopfStefanie KaufmannMichael KieningDenis KrompassCedric LandererYannick MahlichManfred RoosJari BjörneTapio SalakoskiAndrew WongHagit ShatkayFanny GatzmannIngolf SommerMark N WassMichael J E SternbergNives ŠkuncaFran SupekMatko BošnjakPanče PanovSašo DžeroskiTomislav ŠmucYiannis A I KourmpetisAalt D J van DijkCajo J F ter BraakYuanpeng ZhouQingtian GongXinran DongWeidong TianMarco FaldaPaolo FontanaEnrico LavezzoBarbara Di CamilloStefano ToppoLiang LanNemanja DjuricYuhong GuoSlobodan VuceticAmos BairochMichal LinialPatricia C BabbittSteven E BrennerChristine OrengoBurkhard RostSean D MooneyIddo Friedberg
Affiliations

A large-scale evaluation of computational protein function prediction

Predrag Radivojac et al. Nat Methods. 2013 Mar.

Abstract

Automated annotation of protein function is challenging. As the number of sequenced genomes rapidly grows, the overwhelming majority of protein products can only be annotated computationally. If computational predictions are to be relied upon, it is crucial that the accuracy of these methods be high. Here we report the results from the first large-scale community-based critical assessment of protein function annotation (CAFA) experiment. Fifty-four methods representing the state of the art for protein function prediction were evaluated on a target set of 866 proteins from 11 organisms. Two findings stand out: (i) today's best protein function prediction algorithms substantially outperform widely used first-generation methods, with large gains on all types of targets; and (ii) although the top methods perform well enough to guide experiments, there is considerable need for improvement of currently available tools.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interests.

Figures

Figure 1
Figure 1. Experiment timeline and target analysis.
(a) Timeline for the CAFA experiment. (b) Number of target sequences per organism. The graph shows the number of target sequences for each of the ontologies (Molecular Function and Biological Process) as well as the total number of targets, obtained as a union between sequences in the two ontologies. Of 866 proteins, 531 had Molecular Function annotations and 587 had Biological Process annotations. (c) Distribution of target sequences in each ontology according to the number of leaf terms available for each protein sequence. For example, in the Molecular Function category, 79% of proteins had one leaf term, 16% had two leaf terms, and so on. A term is considered a leaf term for a particular target if no other GO term associated with that sequence is its descendant.
Figure 2
Figure 2. Overall performance evaluation.
(a,b) The maximum F-measure for the top-performing methods for Molecular Function ontology (a) and Biological Process ontology (b). All panels show the top ten participating methods in each category as well as the BLAST and Naive baseline methods. Note that 33 models outperformed BLAST in the Molecular Function category, whereas 26 models outperformed BLAST in the Biological Process category (cutoff scores below which methods were excluded from the panels were 0.468 and 0.300 for the Molecular Function and Biological Process categories, respectively). In the Molecular Function category, proteins with “protein binding” as their only leaf term were excluded from the analysis because the protein binding term was not considered informative (results that include those proteins are presented in Supplementary Fig. 3). A perfect predictor would be characterized with Fmax = 1. Confidence intervals (95%) were determined using bootstrapping with n = 10,000 iterations on the set of target sequences. For cases in which a principal investigator participated in multiple teams, only the results of the best-scoring method are presented.
Figure 3
Figure 3. Domain analysis and performance evaluation for single-domain versus multidomain eukaryotic targets.
(a) Distribution of target proteins with respect to the number of Pfam domains they contain. (b) Performance evaluation in the Molecular Function category. Each of the ten top-performing methods showed higher accuracy (higher Fmax) on single-domain proteins. Confidence intervals (95%) were determined using bootstrapping with n = 10,000 iterations on the set of target sequences.
Figure 4
Figure 4. Case study on the human PNPT1 gene.
(a) Domain architecture of human PNPT1 gene according to the Pfam classification. For each domain, the numbers of different leaf terms (for the Molecular Function and Biological Process categories) associated with any protein in Swiss-Prot database containing this domain are shown. (b) Molecular Function terms (six of which are leaves) associated with the human PNPT1 gene in Swiss-Prot as of December 2011. Colored circles represent the predicted terms for three representative methods as well as two baseline methods. The prediction threshold for each method was selected to correspond to the point in the precision-recall space that provides the maximum F-measure. J (blue), Jones-UCL; O (magenta), Team Orengo; d (navy blue), dcGO; B (green), BLAST; N (brown), Naive. Dashed lines indicate the presence of other terms between the source and destination nodes.

References

    1. Liolios K, et al. The Genomes On Line Database (GOLD) in 2009: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 2010;38:D346–D354. doi: 10.1093/nar/gkp848. - DOI - PMC - PubMed
    1. Bork P, et al. Predicting function: from genes to genomes and back. J. Mol. Biol. 1998;283:707–725. doi: 10.1006/jmbi.1998.2144. - DOI - PubMed
    1. Rost B, Liu J, Nair R, Wrzeszczynski KO, Ofran Y. Automatic prediction of protein function. Cell Mol. Life Sci. 2003;60:2637–2650. doi: 10.1007/s00018-003-3114-8. - DOI - PMC - PubMed
    1. Watson JD, Laskowski RA, Thornton JM. Predicting protein function from sequence and structural data. Curr. Opin. Struct. Biol. 2005;15:275–284. doi: 10.1016/j.sbi.2005.04.003. - DOI - PubMed
    1. Friedberg I. Automated protein function prediction—the genomic challenge. Brief. Bioinform. 2006;7:225–242. doi: 10.1093/bib/bbl004. - DOI - PubMed

Publication types