The PhyloFacts FAT-CAT web server: ortholog identification and function prediction using fast approximate tree classification

Cyrus Afrasiabi¹, Bushra Samad, David Dineen, Christopher Meacham, Kimmen Sjölander

Affiliations

PMID: 23685612
PMCID: PMC3692063
DOI: 10.1093/nar/gkt399

The PhyloFacts FAT-CAT web server: ortholog identification and function prediction using fast approximate tree classification

Cyrus Afrasiabi et al. Nucleic Acids Res. 2013 Jul.

. 2013 Jul;41(Web Server issue):W242-8.

doi: 10.1093/nar/gkt399. Epub 2013 May 18.

Authors

Cyrus Afrasiabi¹, Bushra Samad, David Dineen, Christopher Meacham, Kimmen Sjölander

Affiliation

¹ QB3 Institute, University of California, Berkeley, Berkeley, CA 94720-1762, USA.

PMID: 23685612
PMCID: PMC3692063
DOI: 10.1093/nar/gkt399

Abstract

The PhyloFacts 'Fast Approximate Tree Classification' (FAT-CAT) web server provides a novel approach to ortholog identification using subtree hidden Markov model-based placement of protein sequences to phylogenomic orthology groups in the PhyloFacts database. Results on a data set of microbial, plant and animal proteins demonstrate FAT-CAT's high precision at separating orthologs and paralogs and robustness to promiscuous domains. We also present results documenting the precision of ortholog identification based on subtree hidden Markov model scoring. The FAT-CAT phylogenetic placement is used to derive a functional annotation for the query, including confidence scores and drill-down capabilities. PhyloFacts' broad taxonomic and functional coverage, with >7.3 M proteins from across the Tree of Life, enables FAT-CAT to predict orthologs and assign function for most sequence inputs. Four pipeline parameter presets are provided to handle different sequence types, including partial sequences and proteins containing promiscuous domains; users can also modify individual parameters. PhyloFacts trees matching the query can be viewed interactively online using the PhyloScope Javascript tree viewer and are hyperlinked to various external databases. The FAT-CAT web server is available at http://phylogenomics.berkeley.edu/phylofacts/fatcat/.

PubMed Disclaimer

Figures

**Figure 1.**
The FAT-CAT pipeline. The FAT-CAT pipeline starts with the submission of a protein sequence and parameter selection and proceeds through family and subtree HMM scoring to ortholog identification and functional annotation. The FAST-CAT variant differs from the default FAT-CAT pipeline in Stage 3 (indicated by red arrows). In Stage 1, the query is scored against family HMMs in the PhyloFacts database for proteins sharing the same multi-domain architecture (MDA) (shown at top) and HMMs constructed for Pfam domains (shown at bottom). Families meeting Stage 1 criteria (E-value and alignment statistics) are passed to Stage 2. In this toy example, PhyloFacts trees for two Pfam domains and a tree for the MDA meet Stage 1 criteria and are passed to Stage 2. In Stage 2, we obtain an approximate phylogenetic placement of the query in each tree by scoring all the HMMs in the tree. The subtree node corresponding to the top-scoring HMM is examined to determine its suitability as a source of orthologs to the query: Stage 2 parameters include the query-subtree HMM score and alignment statistics and whether the subtree appears to be restricted to orthologs. For each top-scoring node that meets these criteria, we identify a (typically larger) enclosing clade supported by one or more orthology methods. Enclosing clades are passed to Stage 3 for ortholog identification. In Stage 3, FAT-CAT and FAST-CAT diverge. FAT-CAT (blue arrows) evaluates the pairwise alignment between the query and each sequence and identifies all supporting evidence supporting the orthology. FAST-CAT (red arrows) avoids much of this computational complexity by using a fast k-tuple comparison to select the most similar sequences from the enclosing clade, constructing an multiple sequence alignment (MSA) including the query using MAFFT, estimating a phylogenetic tree using FastTree, and extracting a subtree of the phylogenetically closest sequences (i.e. based on tree distance to the query). Alignment analysis can then be restricted to this smaller subset based on the multiple sequence alignment. Sequences meeting these criteria are then passed to Stage 4. In Stage 4, we derive a weighted consensus functional annotation for the query based on orthologs selected in Stage 3. Annotations from close orthologs are given higher weight than those from more distant orthologs, and manually curated annotations are given higher weight than those that are derived computationally.

**Figure 2.**
Example FAT-CAT results. The query sequence (gi|344266516|ref|XP_003405326.1) is a predicted apoptotic protease-activating factor 1 from *Loxodonta africana* (African elphant). Top: the Summary of Results page, presenting an overview of results, including the Pfam MDA for the query produced by scanning Pfam-A HMMs. The FAT-CAT pipeline identified 274 families matching Stage 1 criteria and orthologs from nine different genomes (candidate ortholog clusters). Predicted functional annotations for the query derived from orthologs satisfying Stage 3 criteria are displayed. The Job Summary tab displays the input sequence and all pipeline parameters. Bottom left: Enclosing clades passing Stage 2 criteria, displaying matches along the entire MDA as well as to individual Pfam domains. Bottom right: Clicking on the tree icon in the data table in the Enclosing Clade tab displays the tree for an enclosing clade, highlighting the path from the root of the enclosing clade to the top-scoring node. The Phyloscope viewer allows users to view which sequences have experimental support for their annotations and provides links to external databases and to internal PhyloFacts pages. Results can be viewed online at http://phylogenomics.berkeley.edu/phylofacts/fatcat/2616/.

See this image and copyright information in PMC

References

1. Krishnamurthy N, Brown DP, Kirshner D, Sjölander K. PhyloFacts: an online structural phylogenomic encyclopedia for protein functional and structural classification. Genome Biol. 2006;7:R83. - PMC - PubMed
1. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, et al. The Pfam protein families database. Nucleic Acids Res. 2004;32:D138–D141. - PMC - PubMed
1. Price MN, Dehal PS, Arkin AP. FastTree 2—approximately maximum-likelihood trees for large alignments. PLoS One. 2010;5:e9490. - PMC - PubMed
1. Krishnamurthy N, Brown D, Sjölander K. FlowerPower: clustering proteins into domain architecture classes for phylogenomic inference of protein function. BMC Evol. Biol. 2007;7(Suppl. 1):S12. - PMC - PubMed
1. Brown DP, Krishnamurthy N, Sjölander K. Automated protein subfamily identification and classification. PLoS Comput. Biol. 2007;3:e160. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The PhyloFacts FAT-CAT web server: ortholog identification and function prediction using fast approximate tree classification

Affiliation

The PhyloFacts FAT-CAT web server: ortholog identification and function prediction using fast approximate tree classification

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous