Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2007 Feb 8;7 Suppl 1(Suppl 1):S12.
doi: 10.1186/1471-2148-7-S1-S12.

FlowerPower: clustering proteins into domain architecture classes for phylogenomic inference of protein function

Affiliations
Comparative Study

FlowerPower: clustering proteins into domain architecture classes for phylogenomic inference of protein function

Nandini Krishnamurthy et al. BMC Evol Biol. .

Abstract

Background: Function prediction by transfer of annotation from the top database hit in a homology search has been shown to be prone to systematic error. Phylogenomic analysis reduces these errors by inferring protein function within the evolutionary context of the entire family. However, accuracy of function prediction for multi-domain proteins depends on all members having the same overall domain structure. By contrast, most common homolog detection methods are optimized for retrieving local homologs, and do not address this requirement.

Results: We present FlowerPower, a novel clustering algorithm designed for the identification of global homologs as a precursor to structural phylogenomic analysis. Similar to methods such as PSIBLAST, FlowerPower employs an iterative approach to clustering sequences. However, rather than using a single HMM or profile to expand the cluster, FlowerPower identifies subfamilies using the SCI-PHY algorithm and then selects and aligns new homologs using subfamily hidden Markov models. FlowerPower is shown to outperform BLAST, PSI-BLAST and the UCSC SAM-Target 2K methods at discrimination between proteins in the same domain architecture class and those having different overall domain structures.

Conclusion: Structural phylogenomic analysis enables biologists to avoid the systematic errors associated with annotation transfer; clustering sequences based on sharing the same domain architecture is a critical first step in this process. FlowerPower is shown to consistently identify homologous sequences having the same domain architecture as the query.

Availability: FlowerPower is available as a webserver at http://phylogenomics.berkeley.edu/flowerpower/.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Comparison of performances of BLAST, PSIBLAST, T2K and FlowerPower in identifying global homologs. The X-axis refers to the average sensitivity (or recall) of each method across the dataset and the Y-axis refers to the average precision (or selectivity). Sensitivity is the fraction of the target homolog set identified by a method (i.e., TP/TP+FN). Precision is the fraction of the set selected by a method that belongs to the same domain architecture class (i.e., TP/TP+FP). Results of FlowerPower at varying parameterizations are presented including percent identity cutoffs for sequence selection (25%, 20% and 15%) and stringent ("str") and relaxed ("rel") query and hit coverage cutoffs, based on sequence length. The BLAST parameters refer to e-value cutoffs of 10-20, 10-10 and 10-5. For PSI-BLAST the e-value cut-off used were 10-10, 10-5 and 10-3, using three iterations. T2K was run using default parameters. The inset displays FlowerPower results using different parameterizations.
Figure 2
Figure 2
Analysis of rice protein XP_478746 and BLAST hits. The PFAM domain architectures of XP_478746, annotated as a "TIR/P-loop/LRR disease resistance protein-like protien" (sic) from Oryza sativa, and its closest BLAST homologs are shown. XP_478746 contains a TIR domain only with no room for the P-loop and LRR regions; the sequence is therefore misannotated. We expect XP_478746 was annotated based on local similarity to a sequence such as AAM28917 (annotated as a "putative TIR/NBS/LRR disease resistance protein" from Pinus taeda), which does contain the NB-ARC and LRR domains. See text for details.
Figure 3
Figure 3
Human sphingomyelinase or bacterial isochorismate synthase? The sequence AAF19052 is reported to be a human neutral sphingomyelinase, containing a DEATH domain at the C-terminus. The top panel shows the PFAM domain architecture, which reveals the presence of a chorismate-binding domain at the C-terminus and the absence of a DEATH domain. The lower panel displays a structural phylogenomic analysis, resulting from clustering the sequence homologs to AAF19052 using FlowerPower and construction of a Maximum Parsimony tree. Examination of the phylogenetic tree suggests that AAF19052 (red box) is more likely an isochorismate synthase of bacterial origin. Each node in the tree is labeled with the species of origin, SCI-PHY subfamily label (see Methods), and sequence identifier and definition line.
Figure 4
Figure 4
The FlowerPower algorithm. "Q" indicates the query (or seed) sequence. Sequences sharing the same domain structure are indicated as blue stars; all other sequences are indicated as brown triangles. SCI-PHY subfamilies are indicated by black ovals. 1. Identify a set of potential homologs S using PSI-BLAST; filter to remove much longer or much shorter sequences. 2. Select a core set for initial alignment. 3. Identify subfamilies using SCI-PHY and construct subfamily HMMs (SHMMs). 4. Score S with the SHMMs, and identify those sequences receiving scores with E-values below cutoff. Align each sequence to its closest SHMM. Evaluate the alignment with user-specified criteria; remove sequences that do not meet these criteria. 5. Run SCI-PHY on the new alignment to identify subfamilies and construct SHMMs. 6. Repeat steps 1–5 until convergence.
Figure 5
Figure 5
PFAM domain architecture of the seed dataset used to evaluate BLAST, PSI-BLAST, T2K and FlowerPower. Sequences were selected based on the following criteria: each sequence had to contain recognizable PFAM domains (based on the PFAM gathering threshold), no undefined regions of >80 amino acids (i.e., a region with no PFAM match), and each PFAM domain was required to match a 3D structure classified by the SCOP database. For details of the seed sequence domain architectures see Table 1.

Similar articles

Cited by

References

    1. Bork P, Koonin EV. Predicting functions from protein sequences – where are the bottlenecks? Nat Genet. 1998;18:313–318. doi: 10.1038/ng0498-313. - DOI - PubMed
    1. Eisen JA. Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res. 1998;8:163–167. - PubMed
    1. Galperin MY, Koonin EV. Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement and operon disruption. In Silico Biol. 1998;1:55–67. - PubMed
    1. Sjölander K. Phylogenomic inference of protein molecular function: advances and challenges. Bioinformatics. 2004;20:170–179. doi: 10.1093/bioinformatics/bth021. - DOI - PubMed
    1. Ekman D, Bjorklund AK, Frey-Skott J, Elofsson A. Multi-domain proteins in the three kingdoms of life: orphan domains and other unassigned regions. J Mol Biol. 2005;348:231–243. doi: 10.1016/j.jmb.2005.02.007. - DOI - PubMed

Publication types