Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2005 Oct;1(5):e45.
doi: 10.1371/journal.pcbi.0010045. Epub 2005 Oct 7.

Protein molecular function prediction by Bayesian phylogenomics

Affiliations

Protein molecular function prediction by Bayesian phylogenomics

Barbara E Engelhardt et al. PLoS Comput Biol. 2005 Oct.

Abstract

We present a statistical graphical model to infer specific molecular function for unannotated protein sequences using homology. Based on phylogenomic principles, SIFTER (Statistical Inference of Function Through Evolutionary Relationships) accurately predicts molecular function for members of a protein family given a reconciled phylogeny and available function annotations, even when the data are sparse or noisy. Our method produced specific and consistent molecular function predictions across 100 Pfam families in comparison to the Gene Ontology annotation database, BLAST, GOtcha, and Orthostrapper. We performed a more detailed exploration of functional predictions on the adenosine-5'-monophosphate/adenosine deaminase family and the lactate/malate dehydrogenase family, in the former case comparing the predictions against a gold standard set of published functional characterizations. Given function annotations for 3% of the proteins in the deaminase family, SIFTER achieves 96% accuracy in predicting molecular function for experimentally characterized proteins as reported in the literature. The accuracy of SIFTER on this dataset is a significant improvement over other currently available methods such as BLAST (75%), GeneQuiz (64%), GOtcha (89%), and Orthostrapper (11%). We also experimentally characterized the adenosine deaminase from Plasmodium falciparum, confirming SIFTER's prediction. The results illustrate the predictive power of exploiting a statistical model of function evolution in phylogenomic problems. A software implementation of SIFTER is available from the authors.

PubMed Disclaimer

Conflict of interest statement

Competing interests. The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Percentage of Proteins with Incorrect or Omitted Molecular Function Prediction of the AMP/Adenosine Deaminase Family, Assessed on a Gold Standard Test Set
Results for SIFTER, BLASTA (the most significant non-identity annotated sequence), BLASTB (the most significant non-identity sequence), GeneQuiz, GOtcha, GOtcha-exp (only experimental GO annotations used), Orthostrapper (significant clusters), and Orthostrapper-ns (non-significant clusters). The gold standard test set was manually compiled based on a literature search. All percentages are of true positives relative to the test set. (A) Results for discrimination between just the three deaminase substrates, as a percentage of the 28 possible correct functions. (B) Results for discrimination between the three deaminase substrates plus the additional growth factor domain, as a percentage of the 36 possible correct functions; for BLAST, GeneQuiz, Orthostrapper, and Orthostrapper-ns, we required the transferred annotation to contain both functions; for SIFTER, GOtcha, and GOtcha-exp we required that the two correct functions have the two highest ranking posterior probabilities or scores.
Figure 2
Figure 2. Gene Ontology Hierarchy Section Representing the Functions Associated with the Three Substrate Specificities Found in the AMP/Adenosine Deaminase Pfam Family, and the Growth Factor Activity Associated with a Few Members of the Family
Double ovals represent the four functions, none of which are compatible, corresponding to the random variables associated with the random vector used for inference in SIFTER.
Figure 3
Figure 3. Results for Pruned Version of the AMP/Adenosine Deaminase Family
The reconciled phylogeny used in inference is shown, along with inferential results (both the posterior probabilities for the deaminase substrates and the function prediction based on the maximum posterior probability). Eight of the proteins in this tree were annotated with growth factor activity, with the second highest probability being adenosine deaminase. The function observations used for inference are denoted by filled boxes to the left of the column with the posterior probabilities. For each substrate specificity that arises, a single edge in the phylogeny identifies a possible location for that mutation. The highlighted sequences are discussed in the text. The blue vertices represent speciation events and the red vertices represent duplication events. The tree was rendered using ATV software, version 1.92 [68].
Figure 4
Figure 4. ROC Plots for the AMP/Adenosine Deaminase Family Functional Predictions from BLASTC, SIFTER, and SIFTER-N (Normalized)
These ROC curves were computed over the 28 proteins in the test set for the deaminase family. This figure presents the ROC plot for both the posterior probabilities produced by SIFTER (and normalized for SIFTER-N) and the E-value significance scores from BLASTC, where they are used to annotate proteins, selecting between deaminase substrates AMP, adenine, and adenosine. The false positive axis is scaled logarithmically to focus on true positive percentages when the percentage of false positives is low. FN, false negative; FP, false positive; TN, true negative; TP, true positive.
Figure 5
Figure 5. The Dependence of the Rate of Deamination of Adenosine upon Substrate Concentration with 17 nM Q8IJA9_PLAFA
The open circles are individual data points, while the solid line is the fit of the data to Equation 1. The inset shows raw data for the deamination of three substrates by Q8IJA9_PLAFA as detected by loss of absorbance at 265 nm. The bold, thin, and dashed lines are data for 100 μM adenine, AMP, and adenosine, respectively. The reactions with adenine and AMP contained 860 nM enzyme, while the assay containing adenosine had only 17 nM enzyme. Reaction conditions for all assays were 25 °C in 50 mM potassium phosphate (pH 7.4).
Figure 6
Figure 6. A Depiction of a Fragment of a Phylogeny and the Noisy-OR Model
(A) Two proteins, Q9VFS0 and Q9VFS1, both from Drosophila melanogaster, related by a common ancestor protein. (B) Protein Q9VFS1 has a functional observation for adenosine deaminase (the center rectangle). Also shown are the posterior probabilities for each molecular function as grayscale (white indicating zero and black indicating one) of the annotation vector after inference. Each component of the vector corresponds to a particular deaminase substrate. (C) The noisy-OR model that underlies the inference procedure. We focus on the adenosine deaminase random variable in protein Q9VFS0. The transition probability for this random variable depends on all of the ancestor random variables and the transition parameters qm,n.

References

    1. Galperin MY, Koonin EV. Sources of systematic error in functional annotation of genomes: Domain rearrangement, non-orthologous gene displacement, and operon disruption. In Silico Biol. 1998;1:7. - PubMed
    1. Brenner SE. Errors in genome annotation. Trends Genet. 1999;15:132–133. - PubMed
    1. Koonin EV. Bridging the gap between sequence and function. Trends Genet. 2000;16:16. - PubMed
    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. - PubMed
    1. Khan S, Situ G, Decker K, Schmidt CJ. GoFigure: Automated gene ontology annotation. Bioinformatics. 2003;18:2484–2485. - PubMed

Publication types