Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2008 Feb 6;5(19):151-70.
doi: 10.1098/rsif.2007.1047.

Practical and theoretical advances in predicting the function of a protein by its phylogenetic distribution

Affiliations
Review

Practical and theoretical advances in predicting the function of a protein by its phylogenetic distribution

Philip R Kensche et al. J R Soc Interface. .

Abstract

The gap between the amount of genome information released by genome sequencing projects and our knowledge about the proteins' functions is rapidly increasing. To fill this gap, various 'genomic-context' methods have been proposed that exploit sequenced genomes to predict the functions of the encoded proteins. One class of methods, phylogenetic profiling, predicts protein function by correlating the phylogenetic distribution of genes with that of other genes or phenotypic characteristics. The functions of a number of proteins, including ones of medical relevance, have thus been predicted and subsequently confirmed experimentally. Additionally, various approaches to measure the similarity of phylogenetic profiles and to account for the phylogenetic bias in the data have been proposed. We review the successful applications of phylogenetic profiling and analyse the performance of various profile similarity measures with a set of one microsporidial and 25 fungal genomes. In the fungi, phylogenetic profiling yields high-confidence predictions for the highest and only the highest scoring gene pairs illustrating both the power and the limitations of the approach. Both practical examples and theoretical considerations suggest that in order to get a reliable and specific picture of a protein's function, results from phylogenetic profiling have to be combined with other sources of evidence.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Genotype/phenotype profiling as exemplified by the study of the eukaryotic flagellum by Li et al. (2004).
Figure 2
Figure 2
Overview of published phylogenetic profiling methods and related matrix-similarity methods. The ‘arity’ is the number of genes between which a functional relation is predicted. ‘Localized’ methods require only a coevolution of local region on two proteins to predict a functional link.
Figure 3
Figure 3
Negative influence of non-independence on some naive profiling methods. (a) (i) Two orthologous groups A and B occur in half of the species and have identical patterns of gains. Both Hamming distance (dH) and mutual information (MI) indicate high similarity. Although two additional co-losses in (ii) represent stronger evidence for a functional relation, the mutual information score is lower than in the previous situation (i). (b) (i) A and B are gained and lost independently but Hamming distance suggests high similarity (false positive). (ii) A single independent loss of B early in the phylogeny leads to high Hamming distance (false negative), despite two co-losses. In contrast, with differential Dollo parsimony (dP; see §8.4.3) the example of dependent evolution (ii) would correctly result in a better score than the example of independent evolution (i). dH, Hamming distance; MI, mutual information; dP, differential Dollo parsimony.
Figure 4
Figure 4
Tree-guided approach implemented in the String database (von Mering et al. 2003a). A subtree is collapsed only if all its leaves have the same presence/absence pattern, i.e. if the ancestral state at the subtree's root is known with high certainty.
Figure 5
Figure 5
(a) Phylogenetic profiling has low overall performance (b) but makes highly reliable predictions for the highest scoring 5% of orthologous group pairs. Plotted are the bootstrap medians of AUC and PPV0.05 estimated from a bootstrap sample (n=100) of positive and negative controls. (Results are based on a set of orthologous groups for 25 fungi and the microsporidium E. cuniculi (figure 1 of the electronic supplementary material) and functional associations for S. cerevisiae from the MIPS (Mewes et al. 2006) and KEGG (Kanehisa et al. 2004) databases. In the bootstrapping procedure, each MIPS complex (KEGG pathway) was given the same weight to account for the overrepresentation of some functional categories in the MIPS dataset, such as the large ribosomal subunit. The bootstraps for the PPV estimates were done such that positive and negative controls were sampled with the same probability as to produce an average P/(P+N) ratio of 0.5. Box plots of the ‘weighted’ as well as a ‘normal’, i.e. non-weighted, bootstrap distributions are shown in figure 2 of the electronic supplementary material.) Dashed line, performance of the random classifier; AUC, area under ROC curve; PPV0.05, positive predictive value, i.e. fraction of true positives among the 5% highest scoring predictions. Open circles, MIPS; open squares, KEGG.
Figure 6
Figure 6
The positive predictive value drops quickly with increasing rate of positive predictions of the full MIPS dataset. The KEGG dataset produced similar results (data not shown). The dashed horizontal line indicates the performance of a random classifier, i.e. P/(P+N).

References

    1. Altincicek B, Kollas A, Eberl M, Wiesner J, Sanderbrand S, Hintz M, Beck E, Jomaa H. LytB, a novel gene of the 2-c-methyl-d-erythritol 4-phosphate pathway of isoprenoid biosynthesis in Escherichia coli. FEBS Lett. 2001a;499:37–40. doi: 10.1016/S0014-5793(01)02516-9. - DOI - PubMed
    1. Altincicek B, Kollas A.K, Sanderbrand S, Wiesner J, Hintz M, Beck E, Jomaa H. GcpE is involved in the 2-c-methyl-d-erythritol 4-phosphate pathway of isoprenoid biosynthesis in Escherichia coli. J. Bacteriol. 2001b;183:2411–2416. doi: 10.1128/JB.183.8.2411-2416.2001. - DOI - PMC - PubMed
    1. Amoutzias G.D, Robertson D.L, Oliver S.G, Bornberg-Bauer E. Convergent evolution of gene networks by single-gene duplications in higher eukaryotes. EMBO Rep. 2004;5:274–279. doi: 10.1038/sj.embor.7400096. - DOI - PMC - PubMed
    1. Andersson J.O. Lateral gene transfer in eukaryotes. Cell. Mol. Life Sci. 2005;62:1182–1197. doi: 10.1007/s00018-005-4539-z. - DOI - PMC - PubMed
    1. Andreeva A, Howorth D, Brenner S.E, Hubbard T.J, Chothia C, Murzin A.G. SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res. 2004;32:D226–D229. doi: 10.1093/nar/gkh039. - DOI - PMC - PubMed

Publication types