Enhanced automated function prediction using distantly related sequences and contextual association by PFP

Troy Hawkins¹, Stanislav Luban, Daisuke Kihara

Affiliations

PMID: 16672240
PMCID: PMC2242549
DOI: 10.1110/ps.062153506

Enhanced automated function prediction using distantly related sequences and contextual association by PFP

Troy Hawkins et al. Protein Sci. 2006 Jun.

. 2006 Jun;15(6):1550-6.

doi: 10.1110/ps.062153506. Epub 2006 May 2.

Authors

Troy Hawkins¹, Stanislav Luban, Daisuke Kihara

Affiliation

¹ Department of Biological Sciences, College of Sciences, Purdue University, West Lafayette, Indiana 47907, USA.

PMID: 16672240
PMCID: PMC2242549
DOI: 10.1110/ps.062153506

Abstract

The impetus for the recent development and emergence of automated function prediction methods is an exponentially growing flood of new experimental data, the interpretation of which is hindered by a shortage of reliable annotations for proteins that lack experimental characterization or significant homologs in current databases. Here we introduce PFP, an automated function prediction server that provides the most probable annotations for a query sequence in each of the three branches of the Gene Ontology: biological process, molecular function, and cellular component. Rather than utilizing precise pattern matching to identify functional motifs in the sequences and structures of these proteins, we designed PFP to increase the coverage of function annotation by lowering resolution of predictions when a detailed function is not predictable. To do this we extend a traditional PSI-BLAST search by extracting and scoring annotations (GO terms) individually, including annotations from distantly related sequences, and applying a novel data mining tool, the Function Association Matrix, to score strongly associated pairs of annotations. We show that PFP can correctly assign function using only weakly similar sequences with a significantly better accuracy and coverage than a standard PSI-BLAST search, improving it more than fivefold. The most descriptive annotations predicted by PFP (GO depth > or = 8) can identify a significant subgraph in the GO with > 60% accuracy and approximately 100% coverage for our benchmark set. We also provide examples of the superb performance of PFP in an assessment of automated function prediction servers at the Automated Function Prediction Special Interest Group meeting at ISMB 2005 (AFP-SIG '05).

PubMed Disclaimer

Figures

**Figure 1.**
Sequence coverage of PFP versus top PSI-BLAST. The sequence coverage (Y-axis) is the percentage of sequences for which a correct biological process (sharing a common parent with a target annotation at a GO depth ≥ 4) was ranked in the top five results output by PFP. The E-value cutoff value (X-axis) represents the minimum similarity for sequences used by PFP in our benchmark analysis. PFP + FAM1000 (solid black line) is PFP with associations scored by the FAM1000 matrix. PFP (w/o FAM) (broken black line) is PFP without scored associations. Top PSI-BLAST (solid gray line) transfers annotations from the most similar sequence scoring above each E-value cutoff.

**Figure 2.**
Annotation-level accuracy of PFP at different GO depths. (A) The specificity (percentage of predicted annotations sharing a common parent with a target annotation at a GO depth ≥ 4) is shown at each GO depth of predicted annotation. The overprediction distance (dark gray columns, right-hand axis) is the average edge distance between a predicted annotation and the common parent it shares with the closest target annotation; the underprediction distance is the average edge distance between a target annotation and the common parent (light gray columns). The GO depth is the edge distance between each predicted (*top*) or target (*bottom*) annotation and the category root. (B) The coverage (percentage of correctly predicted target annotations) is shown at each GO depth of the correct annotation. The overprediction distance (dark gray columns, right-hand axis) is the average edge distance between each target annotation and the common parent it shares with the closest predicted annotation; the underprediction distance is the average edge distance between the closest predicted annotation and the common parent (light gray columns). For both A and B, “E-value Cutoff = 0” (solid black line) is PFP + FAM1000 and “E-value Cutoff = 15” (broken black line) is PFP + FAM1000 using only sequence hits from PSI-BLAST with an E-value of 15 or larger.

See this image and copyright information in PMC

References

1. Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215 403–410. - PubMed
1. Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25 3389–3402. - PMC - PubMed
1. Harris M.A., Clark J., Ireland A., Lomax J., Ashburner M., Foulger R., Eilbeck K., Lewis S., Marshall B., Mungall C.et al. 2004. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 32 D258–D261. - PMC - PubMed
1. Hawkins T. and Kihara D. 2005a. PFP: Automatic annotation of protein function by relative GO association in multiple functional contexts. The 13th Annual International Conference on Intelligent Systems for Molecular Biology. Detroit, MI. –117.
1. Hawkins T. and Kihara D. 2005b. The use of context-based functional association in automated protein function prediction methods. The 13th Annual International Conference on Intelligent Systems for Molecular Biology, Automatic Function Prediction—Special Interest Group. Detroit, MI. pp. 16–17.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Enhanced automated function prediction using distantly related sequences and contextual association by PFP

Affiliation

Enhanced automated function prediction using distantly related sequences and contextual association by PFP

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials