Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Nov 9:12:434.
doi: 10.1186/1471-2105-12-434.

ProPhylo: partial phylogenetic profiling to guide protein family construction and assignment of biological process

Affiliations

ProPhylo: partial phylogenetic profiling to guide protein family construction and assignment of biological process

Malay K Basu et al. BMC Bioinformatics. .

Abstract

Background: Phylogenetic profiling is a technique of scoring co-occurrence between a protein family and some other trait, usually another protein family, across a set of taxonomic groups. In spite of several refinements in recent years, the technique still invites significant improvement. To be its most effective, a phylogenetic profiling algorithm must be able to examine co-occurrences among protein families whose boundaries are uncertain within large homologous protein superfamilies.

Results: Partial Phylogenetic Profiling (PPP) is an iterative algorithm that scores a given taxonomic profile against the taxonomic distribution of families for all proteins in a genome. The method works through optimizing the boundary of each protein family, rather than by relying on prebuilt protein families or fixed sequence similarity thresholds. Double Partial Phylogenetic Profiling (DPPP) is a related procedure that begins with a single sequence and searches for optimal granularities for its surrounding protein family in order to generate the best query profiles for PPP. We present ProPhylo, a high-performance software package for phylogenetic profiling studies through creating individually optimized protein family boundaries. ProPhylo provides precomputed databases for immediate use and tools for manipulating the taxonomic profiles used as queries.

Conclusion: ProPhylo results show universal markers of methanogenesis, a new DNA phosphorothioation-dependent restriction enzyme, and efficacy in guiding protein family construction. The software and the associated databases are freely available under the open source Perl Artistic License from ftp://ftp.jcvi.org/pub/data/ppp/.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Flowchart for profile search using ProPhylo with the Partial Phylogenetic Profile algorithm (PPP). For each step, the relevant software name is indicated. (A) Creation of a profile from various search methods or directly using a set of GenBank GIs. (B) The created profile is a tab delimited text file containing a set of taxonomic IDs from NCBI taxonomic database and 1's and 0's for the presence and absence of the query protein family. (C) The main script ppp.pl searches a given genome using the query profile and generates results as a ranked list of candidate functionally linked proteins.
Figure 2
Figure 2
Distribution of the scores of PPP. The query profile contains all methanogens marked as 1, and the target genome is Methanohalophilus mahii DSM 5219. The binomial distribution probability parameter is raised to 0.2, which helps proteins absolutely restricted to the methanogens, although not universal among them, to get a better relative rank. The plot shows PPP score on the Y-axis, and rank, sorted by PPP score, on the X-axis. The top-scoring 28, with perfect agreement to the query profile, are colored red.
Figure 3
Figure 3
Flowchart of profile search using ProPhylo with Double Partial Phylogenetic profiling (DPPP). (A) Search begins with a single query sequence with its BLAST hits. (B) The program then generates a different query profile for each depth in the BLAST hit list. (C) Each of these profiles is then searched against the target genome. (D) The top hits for each of these searches are then collected and the output is presented sorted by significance.
Figure 4
Figure 4
Double partial phylogenetic profiling (DPPP) using the GTP-binding protein HydF (GI:113971588) as query. For the query protein (red), the curve rises monotonically, because it measures the correlation of the list of species it generates to itself. Among all proteins other than the query, the peak score for any protein occurs (for HydE, GI:113971587,) where the query protein BLAST hits list depth is about 210. DPPP scores are shown for query protein depths 10 to 930, sampled every tenth hit, for the ten proteins that scored the best at the depth 210. The curves for HydE (GI:113971587, olive), HydG (GI:113971585, green), and the hydrogenase large (GI:113971582, blue) and small (GI:113971583, purple) subunits all peak at this query protein depth, which largely exhausts the list of species that carry the [FeFe] hydrogenase maturation system (see text and Table 3 for details). Proteins unrelated to [FeFe] hydrogenase maturation are shown in gray.
Figure 5
Figure 5
Distribution of PPP scores resulting from a GI:113971588-based query profile at a depth of 200 distinct genomes (Shewanella sp. MR-4). Green: the query gene itself and the four correlated hydrogenase and hydrogenase maturation genes listed in Table 3. Yellow: the ThiH gene.

References

    1. Kensche PR, van Noort V, Dutilh BE, Huynen MA. Practical and theoretical advances in predicting the function of a protein by its phylogenetic distribution. J R Soc Interface. 2008;5:151–70. doi: 10.1098/rsif.2007.1047. - DOI - PMC - PubMed
    1. Freilich S, Goldovsky L, Gottlieb A, Blanc E, Tsoka S, Ouzounis CA. Stratification of co-evolving genomic groups using ranked phylogenetic profiles. BMC Bioinformatics. 2009;10:355. doi: 10.1186/1471-2105-10-355. - DOI - PMC - PubMed
    1. Tatusov RL, Koonin EV, Lipman DJ. A genomic perspective on protein families. Science. 1997;278:631–7. doi: 10.1126/science.278.5338.631. - DOI - PubMed
    1. Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci USA. 1999;96:4285–8. doi: 10.1073/pnas.96.8.4285. - DOI - PMC - PubMed
    1. Gaasterland T, Ragan MA. Microbial genescapes: phyletic and functional patterns of ORF distribution among prokaryotes. Microb Comp Genomics. 1998;3:199–217. - PubMed

Publication types

LinkOut - more resources