. 2008 Jan 25:9:52.

doi: 10.1186/1471-2105-9-52.

The development of PIPA: an integrated and automated pipeline for genome-wide protein function annotation

Chenggang Yu¹, Nela Zavaljevski, Valmik Desai, Seth Johnson, Fred J Stevens, Jaques Reifman

Affiliations

Affiliation

¹ Biotechnology HPC Software Applications Institute, Telemedicine and Advanced Technology Research Center, US Army Medical Research and Materiel Command, Ft. Detrick, MD, USA. cyu@bioanalysis.org

PMID: 18221520
PMCID: PMC2259298
DOI: 10.1186/1471-2105-9-52

The development of PIPA: an integrated and automated pipeline for genome-wide protein function annotation

Chenggang Yu et al. BMC Bioinformatics. 2008.

. 2008 Jan 25:9:52.

doi: 10.1186/1471-2105-9-52.

Authors

Chenggang Yu¹, Nela Zavaljevski, Valmik Desai, Seth Johnson, Fred J Stevens, Jaques Reifman

Affiliation

¹ Biotechnology HPC Software Applications Institute, Telemedicine and Advanced Technology Research Center, US Army Medical Research and Materiel Command, Ft. Detrick, MD, USA. cyu@bioanalysis.org

PMID: 18221520
PMCID: PMC2259298
DOI: 10.1186/1471-2105-9-52

Abstract

Background: Automated protein function prediction methods are needed to keep pace with high-throughput sequencing. With the existence of many programs and databases for inferring different protein functions, a pipeline that properly integrates these resources will benefit from the advantages of each method. However, integrated systems usually do not provide mechanisms to generate customized databases to predict particular protein functions. Here, we describe a tool termed PIPA (Pipeline for Protein Annotation) that has these capabilities.

Results: PIPA annotates protein functions by combining the results of multiple programs and databases, such as InterPro and the Conserved Domains Database, into common Gene Ontology (GO) terms. The major algorithms implemented in PIPA are: (1) a profile database generation algorithm, which generates customized profile databases to predict particular protein functions, (2) an automated ontology mapping generation algorithm, which maps various classification schemes into GO, and (3) a consensus algorithm to reconcile annotations from the integrated programs and databases.PIPA's profile generation algorithm is employed to construct the enzyme profile database CatFam, which predicts catalytic functions described by Enzyme Commission (EC) numbers. Validation tests show that CatFam yields average recall and precision larger than 95.0%. CatFam is integrated with PIPA. We use an association rule mining algorithm to automatically generate mappings between terms of two ontologies from annotated sample proteins. Incorporating the ontologies' hierarchical topology into the algorithm increases the number of generated mappings. In particular, it generates 40.0% additional mappings from the Clusters of Orthologous Groups (COG) to EC numbers and a six-fold increase in mappings from COG to GO terms. The mappings to EC numbers show a very high precision (99.8%) and recall (96.6%), while the mappings to GO terms show moderate precision (80.0%) and low recall (33.0%). Our consensus algorithm for GO annotation is based on the computation and propagation of likelihood scores associated with GO terms. The test results suggest that, for a given recall, the application of the consensus algorithm yields higher precision than when consensus is not used.

Conclusion: The algorithms implemented in PIPA provide automated genome-wide protein function annotation based on reconciled predictions from multiple resources.

PubMed Disclaimer

Figures

**Figure 1**
**Overview of PIPA's key modules**. PIPA's programs are organized into three modules. The pipeline execution module consists of programs that enable user access to and control of the pipeline's parallel execution of multiple programs. The execution module wraps the core module, containing all integrated methods (programs and databases), the terminology conversion program, and the consensus annotation program. The support module contains the profile database generation program, which creates new profile databases, and the GO-mapping generation program, which creates GO mappings for the terminology conversion program.

**Figure 2**
**CatFam performance evaluation**. The performance of CatFam is measured by precision and recall, which are defined as precision = TP/(TP+FP) and recall = TP/(TP+FN), where TP, FP and FN represent the number of true-positive, false-positive, and false-negative predictions, respectively. A total of 18,949 proteins, not used for profile generation, are used to evaluate two CatFam databases, CatFam-3D and CatFam-4D, which predict 3-digit and 4-digit EC numbers for query proteins, respectively. The results are sorted according to the maximum sequence identity between the query protein and the proteins used for profile generation.

**Figure 3**
**Mapping evaluation**. The number of automatically generated mappings is significantly increased for properly-selected cut-off E-values when the hierarchical topology of the ontologies is used. Larger cut-off E-values (small values on the x axis) result in excessive false hits for sample proteins, while smaller cut-off E-values exclude true hits. Both cases reduce the number of mappings that can be generated.

**Figure 4**
**GO consensus evaluation**. PIPA's GO annotations with and without the consensus algorithm are compared for hierarchical precision (HP) and hierarchical recall (HR). For the consensus algorithm, each data point in the figure is computed by using different combinations of the three parameters, E₀, E₁, and SAT. When the consensus algorithm is not used, each data point is obtained by selecting a different cut-off E-value. The figure indicates that the consensus algorithm improves HP for each of the three GO categories. The highlighted points correspond to consensus algorithm results with parameters E₀= 0.01, E₁= 1e-200, and SAT = 0.99.

See this image and copyright information in PMC

Cited by

Identification and optimization of classifier genes from multi-class earthworm microarray dataset.
Li Y, Wang N, Perkins EJ, Zhang C, Gong P. Li Y, et al. PLoS One. 2010 Oct 28;5(10):e13715. doi: 10.1371/journal.pone.0013715. PLoS One. 2010. PMID: 21060837 Free PMC article.
The automatic annotation of bacterial genomes.
Richardson EJ, Watson M. Richardson EJ, et al. Brief Bioinform. 2013 Jan;14(1):1-12. doi: 10.1093/bib/bbs007. Epub 2012 Mar 9. Brief Bioinform. 2013. PMID: 22408191 Free PMC article.
AGeS: a software system for microbial genome sequence annotation.
Kumar K, Desai V, Cheng L, Khitrov M, Grover D, Satya RV, Yu C, Zavaljevski N, Reifman J. Kumar K, et al. PLoS One. 2011 Mar 7;6(3):e17469. doi: 10.1371/journal.pone.0017469. PLoS One. 2011. PMID: 21408217 Free PMC article.
Quantitative frame analysis and the annotation of GC-rich (and other) prokaryotic genomes. An application to Anaeromyxobacter dehalogenans.
Oden S, Brocchieri L. Oden S, et al. Bioinformatics. 2015 Oct 15;31(20):3254-61. doi: 10.1093/bioinformatics/btv339. Epub 2015 Jun 4. Bioinformatics. 2015. PMID: 26048600 Free PMC article.
Integration of bioinformatics to biodegradation.
Arora PK, Bae H. Arora PK, et al. Biol Proced Online. 2014 Apr 27;16:8. doi: 10.1186/1480-9222-16-8. eCollection 2014. Biol Proced Online. 2014. PMID: 24808763 Free PMC article. Review.

See all "Cited by" articles

References

1. Whisstock JC, Lesk AM. Q Rev Biophys. 2004/03/20. Vol. 36. 2003. Prediction of protein function from protein sequence and structure; pp. 307–340. - DOI - PubMed
1. Sjolander K. Bioinformatics. 2004/01/22. Vol. 20. 2004. Phylogenomic inference of protein molecular function: advances and challenges; pp. 170–179. - DOI - PubMed
1. Ofran Y, Punta M, Schneider R, Rost B. Drug Discov Today. 2005/10/26. Vol. 10. 2005. Beyond annotation transfer by homology: novel protein-function prediction methods to assist drug discovery; pp. 1475–1482. - DOI - PubMed
1. Friedberg I. Brief Bioinform. 2006/06/15. Vol. 7. 2006. Automated protein function prediction--the genomic challenge; pp. 225–242. - DOI - PubMed
1. Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, Eddy SR, Sonnhammer EL, Bateman A. Nucleic Acids Res. 2005/12/31. Vol. 34. 2006. Pfam: clans, web tools and services; pp. D247–51. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The development of PIPA: an integrated and automated pipeline for genome-wide protein function annotation

Affiliation

The development of PIPA: an integrated and automated pipeline for genome-wide protein function annotation

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources