. 2015 Sep 14:4:43.

doi: 10.1186/s13742-015-0083-4. eCollection 2015.

The PFP and ESG protein function prediction methods in 2014: effect of database updates and ensemble approaches

Ishita K Khan¹, Qing Wei¹, Samuel Chapman², Dukka B Kc², Daisuke Kihara³

Affiliations

¹ Department of Computer Sciences, Purdue University, West Lafayette, IN 47907 USA.
² Department of Computational Science and Engineering, North Carolina A & T State University, Greensboro, NC 27411 USA.
³ Department of Computer Sciences, Purdue University, West Lafayette, IN 47907 USA ; Department of Biological Sciences, Purdue University, West Lafayette, IN 47907 USA.

PMID: 26380077
PMCID: PMC4570625
DOI: 10.1186/s13742-015-0083-4

The PFP and ESG protein function prediction methods in 2014: effect of database updates and ensemble approaches

Ishita K Khan et al. Gigascience. 2015.

. 2015 Sep 14:4:43.

doi: 10.1186/s13742-015-0083-4. eCollection 2015.

Authors

Ishita K Khan¹, Qing Wei¹, Samuel Chapman², Dukka B Kc², Daisuke Kihara³

Affiliations

¹ Department of Computer Sciences, Purdue University, West Lafayette, IN 47907 USA.
² Department of Computational Science and Engineering, North Carolina A & T State University, Greensboro, NC 27411 USA.
³ Department of Computer Sciences, Purdue University, West Lafayette, IN 47907 USA ; Department of Biological Sciences, Purdue University, West Lafayette, IN 47907 USA.

PMID: 26380077
PMCID: PMC4570625
DOI: 10.1186/s13742-015-0083-4

Abstract

Background: Functional annotation of novel proteins is one of the central problems in bioinformatics. With the ever-increasing development of genome sequencing technologies, more and more sequence information is becoming available to analyze and annotate. To achieve fast and automatic function annotation, many computational (automated) function prediction (AFP) methods have been developed. To objectively evaluate the performance of such methods on a large scale, community-wide assessment experiments have been conducted. The second round of the Critical Assessment of Function Annotation (CAFA) experiment was held in 2013-2014. Evaluation of participating groups was reported in a special interest group meeting at the Intelligent Systems in Molecular Biology (ISMB) conference in Boston in 2014. Our group participated in both CAFA1 and CAFA2 using multiple, in-house AFP methods. Here, we report benchmark results of our methods obtained in the course of preparation for CAFA2 prior to submitting function predictions for CAFA2 targets.

Results: For CAFA2, we updated the annotation databases used by our methods, protein function prediction (PFP) and extended similarity group (ESG), and benchmarked their function prediction performances using the original (older) and updated databases. Performance evaluation for PFP with different settings and ESG are discussed. We also developed two ensemble methods that combine function predictions from six independent, sequence-based AFP methods. We further analyzed the performances of our prediction methods by enriching the predictions with prior distribution of gene ontology (GO) terms. Examples of predictions by the ensemble methods are discussed.

Conclusions: Updating the annotation database was successful, improving the Fmax prediction accuracy score for both PFP and ESG. Adding the prior distribution of GO terms did not make much improvement. Both of the ensemble methods we developed improved the average Fmax score over all individual component methods except for ESG. Our benchmark results will not only complement the overall assessment that will be done by the CAFA organizers, but also help elucidate the predictive powers of sequence-based function prediction methods in general.

Keywords: CAFA; ESG; PFP; Protein function; consensus method; ensemble method; function prediction; gene annotation; sequence.

PubMed Disclaimer

Figures

**Fig. 1**
Performance of protein function prediction (PFP) evaluated on GO terms including parental terms. Performance of PFP using the new and the old PFP database (PFPDB). Before evaluating predictions, both predicted and true GO terms were propagated to the root of the ontology. (a) Evaluation on biological process (BP) GO terms. (b) Evaluation on molecular function (MF) GO terms

**Fig. 2**
Performance of PFP and extended similarity group (ESG) on GO terms including parental terms. Each predicted and true GO term was propagated to the root of the ontology before evaluation. GO terms in all three ontologies (BP, MF, CC) were used in computing prediction accuracy

**Fig. 3**
Fraction of queries where each method showed the largest F_max score. The fraction on the y-axis was computed as the number of queries in which a method had the largest F_max score over the total number of queries (2,055 protein sequences). Frequent pattern mining (FPM) in this graph denotes FPM_MaxLen because it performed better than its counterpart, FPM_maxscoreLen. The fraction does not sum up to 100 % because there were cases where multiple methods tied for the largest F_max score

**Fig. 4**
Performance with prior GO term distribution. For PFP, ESG, CONS, FPM, and ESG-OLD, prior GO term distribution was added as a part of the predictions. The numbers shown in the symbol legend are the average F_max scores of the methods. (a) ROC curve. The x-axis is the true negative rate while the y-axis shows the true positive rate. (b) The same data are shown in a precision-recall curve

See this image and copyright information in PMC

Cited by

Advanced Situation with Recombinant Toxins: Diversity, Production and Application Purposes.
Efremenko E, Aslanli A, Lyagin I. Efremenko E, et al. Int J Mol Sci. 2023 Feb 27;24(5):4630. doi: 10.3390/ijms24054630. Int J Mol Sci. 2023. PMID: 36902061 Free PMC article. Review.
INGA 2.0: improving protein function prediction for the dark proteome.
Piovesan D, Tosatto SCE. Piovesan D, et al. Nucleic Acids Res. 2019 Jul 2;47(W1):W373-W378. doi: 10.1093/nar/gkz375. Nucleic Acids Res. 2019. PMID: 31073595 Free PMC article.
ContactPFP: Protein function prediction using predicted contact information.
Kagaya Y, Flannery ST, Jain A, Kihara D. Kagaya Y, et al. Front Bioinform. 2022 Jun;2:896295. doi: 10.3389/fbinf.2022.896295. Epub 2022 Jun 2. Front Bioinform. 2022. PMID: 35875419 Free PMC article.
Proteomic profiling of hydatid fluid from pulmonary cystic echinococcosis.
Dos Santos GB, da Silva ED, Kitano ES, Battistella ME, Monteiro KM, de Lima JC, Ferreira HB, Serrano SMT, Zaha A. Dos Santos GB, et al. Parasit Vectors. 2022 Mar 21;15(1):99. doi: 10.1186/s13071-022-05232-8. Parasit Vectors. 2022. PMID: 35313982 Free PMC article.
BUSCA: an integrative web server to predict subcellular localization of proteins.
Savojardo C, Martelli PL, Fariselli P, Profiti G, Casadio R. Savojardo C, et al. Nucleic Acids Res. 2018 Jul 2;46(W1):W459-W466. doi: 10.1093/nar/gky320. Nucleic Acids Res. 2018. PMID: 29718411 Free PMC article.

See all "Cited by" articles

References

1. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–402. doi: 10.1093/nar/25.17.3389. - DOI - PMC - PubMed
1. Pearson WR. Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 1990;183:63–98. doi: 10.1016/0076-6879(90)83007-V. - DOI - PubMed
1. Pearson WR, Lipman DJ. Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A. 1988;85:2444–8. doi: 10.1073/pnas.85.8.2444. - DOI - PMC - PubMed
1. Attwood TK, Bradley P, Flower DR, Gaulton A, Maudling N, Mitchell AL, et al. PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Res. 2003;31:400–2. doi: 10.1093/nar/gkg030. - DOI - PMC - PubMed
1. Bru C, Courcelle E, Carrère S, Beausse Y, Dalmar S, Kahn D. The ProDom database of protein domain families: more emphasis on 3D. Nucleic Acids Res. 2005;212–5. - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions

Substances

Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The PFP and ESG protein function prediction methods in 2014: effect of database updates and ensemble approaches

Affiliations

The PFP and ESG protein function prediction methods in 2014: effect of database updates and ensemble approaches

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources