Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2006 Jun;15(6):1550-6.
doi: 10.1110/ps.062153506. Epub 2006 May 2.

Enhanced automated function prediction using distantly related sequences and contextual association by PFP

Affiliations

Enhanced automated function prediction using distantly related sequences and contextual association by PFP

Troy Hawkins et al. Protein Sci. 2006 Jun.

Abstract

The impetus for the recent development and emergence of automated function prediction methods is an exponentially growing flood of new experimental data, the interpretation of which is hindered by a shortage of reliable annotations for proteins that lack experimental characterization or significant homologs in current databases. Here we introduce PFP, an automated function prediction server that provides the most probable annotations for a query sequence in each of the three branches of the Gene Ontology: biological process, molecular function, and cellular component. Rather than utilizing precise pattern matching to identify functional motifs in the sequences and structures of these proteins, we designed PFP to increase the coverage of function annotation by lowering resolution of predictions when a detailed function is not predictable. To do this we extend a traditional PSI-BLAST search by extracting and scoring annotations (GO terms) individually, including annotations from distantly related sequences, and applying a novel data mining tool, the Function Association Matrix, to score strongly associated pairs of annotations. We show that PFP can correctly assign function using only weakly similar sequences with a significantly better accuracy and coverage than a standard PSI-BLAST search, improving it more than fivefold. The most descriptive annotations predicted by PFP (GO depth > or = 8) can identify a significant subgraph in the GO with > 60% accuracy and approximately 100% coverage for our benchmark set. We also provide examples of the superb performance of PFP in an assessment of automated function prediction servers at the Automated Function Prediction Special Interest Group meeting at ISMB 2005 (AFP-SIG '05).

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Sequence coverage of PFP versus top PSI-BLAST. The sequence coverage (Y-axis) is the percentage of sequences for which a correct biological process (sharing a common parent with a target annotation at a GO depth ≥ 4) was ranked in the top five results output by PFP. The E-value cutoff value (X-axis) represents the minimum similarity for sequences used by PFP in our benchmark analysis. PFP + FAM1000 (solid black line) is PFP with associations scored by the FAM1000 matrix. PFP (w/o FAM) (broken black line) is PFP without scored associations. Top PSI-BLAST (solid gray line) transfers annotations from the most similar sequence scoring above each E-value cutoff.
Figure 2.
Figure 2.
Annotation-level accuracy of PFP at different GO depths. (A) The specificity (percentage of predicted annotations sharing a common parent with a target annotation at a GO depth ≥ 4) is shown at each GO depth of predicted annotation. The overprediction distance (dark gray columns, right-hand axis) is the average edge distance between a predicted annotation and the common parent it shares with the closest target annotation; the underprediction distance is the average edge distance between a target annotation and the common parent (light gray columns). The GO depth is the edge distance between each predicted (top) or target (bottom) annotation and the category root. (B) The coverage (percentage of correctly predicted target annotations) is shown at each GO depth of the correct annotation. The overprediction distance (dark gray columns, right-hand axis) is the average edge distance between each target annotation and the common parent it shares with the closest predicted annotation; the underprediction distance is the average edge distance between the closest predicted annotation and the common parent (light gray columns). For both A and B, “E-value Cutoff = 0” (solid black line) is PFP + FAM1000 and “E-value Cutoff = 15” (broken black line) is PFP + FAM1000 using only sequence hits from PSI-BLAST with an E-value of 15 or larger.

References

    1. Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215 403–410. - PubMed
    1. Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25 3389–3402. - PMC - PubMed
    1. Harris M.A., Clark J., Ireland A., Lomax J., Ashburner M., Foulger R., Eilbeck K., Lewis S., Marshall B., Mungall C.et al. 2004. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 32 D258–D261. - PMC - PubMed
    1. Hawkins T. and Kihara D. 2005a. PFP: Automatic annotation of protein function by relative GO association in multiple functional contexts. The 13th Annual International Conference on Intelligent Systems for Molecular Biology. Detroit, MI. –117.
    1. Hawkins T. and Kihara D. 2005b. The use of context-based functional association in automated protein function prediction methods. The 13th Annual International Conference on Intelligent Systems for Molecular Biology, Automatic Function Prediction—Special Interest Group. Detroit, MI. pp. 16–17.

Publication types