Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2006 Oct 23:7:466.
doi: 10.1186/1471-2105-7-466.

ProFAT: a web-based tool for the functional annotation of protein sequences

Affiliations

ProFAT: a web-based tool for the functional annotation of protein sequences

Charles Richard Bradshaw et al. BMC Bioinformatics. .

Abstract

Background: The functional annotation of proteins relies on published information concerning their close and remote homologues in sequence databases. Evidence for remote sequence similarity can be further strengthened by a similar biological background of the query sequence and identified database sequences. However, few tools exist so far, that provide a means to include functional information in sequence database searches.

Results: We present ProFAT, a web-based tool for the functional annotation of protein sequences based on remote sequence similarity. ProFAT combines sensitive sequence database search methods and a fold recognition algorithm with a simple text-mining approach. ProFAT extracts identified hits based on their biological background by keyword-mining of annotations, features and most importantly, literature associated with a sequence entry. A user-provided keyword list enables the user to specifically search for weak, but biologically relevant homologues of an input query. The ProFAT server has been evaluated using the complete set of proteins from three different domain families, including their weak relatives and could correctly identify between 90% and 100% of all domain family members studied in this context. ProFAT has furthermore been applied to a variety of proteins from different cellular contexts and we provide evidence on how ProFAT can help in functional prediction of proteins based on remotely conserved proteins.

Conclusion: By employing sensitive database search programs as well as exploiting the functional information associated with database sequences, ProFAT can detect remote, but biologically relevant relationships between proteins and will assist researchers in the prediction of protein function based on remote homologies.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Workflow of a ProFAT Analysis. (A) A protein sequence and a keyword list are required inputs for a ProFAT analysis. The first step carried out by ProFAT is a domain search (RPS-BLAST) against the CDD-database from the NCBI. If no conserved domain is detected with RPS-BLAST, the user can proceed to domain prediction (A, right figure), which combines a RPS-BLAST search with relaxed parameters with a BLAST-search and subsequent text-mining for the biological relevance of identified hits. Alternatively, the user can choose to split the sequence into fragments between 150 and 300 amino acids for further processing. Selected conserved domains and/or regions of the input query can then be submitted to the Annotation Engine and/or Threading Engine. The Annotation Engine combines a PSI-BLAST search with text-mining of Gene Ontology annotation, features and PubMed abstracts associated with identified hits, thereby extracting hits involved in the process/function described by the user's keyword list. The Threading Engine combines a Threader 3.5 run with text-mining of associated PDB-keywords, features, compound information and PubMed abstracts of identified structures for post-filtering using keywords from the user-provided keyword list. (B) HMMerThread pipeline. HMMerThread combines a HMMer-search against the PFAM-database of conserved domains with a Threader run. The input query is first sent to an HMMer-search, whereby only domains with an associated 3D-structure are chosen for further processing. Selected domains are then sent to Threader 3.5, with prior secondary structure prediction (PSI-PRED), coiled-coil prediction (COILS2) and low-complexity filtering (SEG), which are all performed on the entire input sequence to achieve higher accuracy. HMMerThread therefore can give a highly accurate prediction of conserved domains.
Figure 2
Figure 2
Domain search and domain prediction using ProFAT. (A) Results for Dip13α/APPL1 [GenBank:NP_036228] from a ProFAT domain search. RPS-BLAST identified a PH-domain and a PTB-domain in the input query, the N-terminal region does not contain any conserved domains with the chosen E-value cutoff (E <= 1E-04). The upper window gives the user a description of the domain as found in CDD by mousing over the domain box. The image represents the sequence with identified conserved domains. The table at the bottom lists the identified domains with their amino acid boundaries in the input sequence. By either activating the checkboxes or by clicking on the region and/or conserved domain on the image, the user can select conserved domains/regions for further processing by the Annotation Engine and/or Threading Engine (selectable by a check-box and activated by pressing the 'Submit' box). In this case, the N-terminal region from amino acid 1 to 280 was selected for further processing. (B) If no domain was identified, the user can perform Domain Prediction. In this case, a RPS-BLAST search with an E-value cutoff of 100 is used to identify weak domain hits. The consensus sequences of these domains are, in turn, submitted to a regular BLAST-search with subsequent text-mining for keywords occurring in the user-provided keyword list. In the case of the Dip13α/APPL1 N-terminal domain, RPS-BLAST finds SMC-domains, Biopterin_H, as well as a COG-domain. The identified domains can be submitted to the Annotation Engine and Threading Engine for a more detailed analysis (link 'Send to ProFAT').
Figure 3
Figure 3
Results from the Core Modules of ProFAT. (A) Graphical and tabular representation of results from the Annotation Engine and Threading Engine (Dip13α/APPL1 [GenBank:NP_036228] was used as a query). Red bars in the image represent identified database sequences that contain one or more keywords from the user-provided list in their annotation, blue bars represent sequences where no keywords were detected. The upper bars show results from the Annotation Engine, the lower bars those from the Threading Engine. The table below the image gives the user the number of hits with and without keywords, links to the raw results, tabular information on the frequency of observed GO-terms, as well as the starting and ending residue of the region and conserved domains in the input query. The numbers in the column 'Keyword Hits' link to the annotated alignments of keyword-positive database entries. Moving the mouse over the respective number changes the format of the graph to the image seen in (B), whereby alignments are represented by narrow lines. The number in 'Total Hits' links to the complete PSI-BLAST output, whereby each alignment is annotated with the associated information of the database hit. (B) Graphical output of the region 1 – 280 of the input query from the Annotation Engine. (C) Representative alignment of one of the identified hits that shows biological relevance next to sequence similarity. Each sequence that has been identified by PSI-BLAST is annotated with associated GenBank features, PubMed abstracts and Gene Ontology terms, as well as its sequence. Associated information can be individually viewed by clicking on the '+' sign next to the respective information.
Figure 4
Figure 4
Typical results from ProFAT's Threading Engine. (A) Threader alignment of the PH-domain of Dip13α/APPL1 [GenBank:NP_036228]. The Threading Engine picked up the crystal structure of the PH-domain of the protein Tiam1 ([PDB:1FOE]). Secondary structure elements are shown above the identified hit. The CATH ID, the threading score, as well as the PDB-ID are given underneath the alignment. The features, abstracts of associated publications, PDB compound information and the PDB-keywords can be individually visualized by the user. In this case, the abstract of the associated paper of 1FOE, as well as the PDB-keywords are shown. (B) Processed results of the Threader-output. In this case, the top five hits are shown, including their score, function, compound and keyword information.
Figure 5
Figure 5
Typical Output of an HMMerThread run. (A) Results from the HMMer-search against the PFAM conserved domain database. The input query was Dip13α/APPL1 [GenBank:NP_036228]. HMMer identified next to the PH- and PTB-domain 5 potential conserved domains in the N-terminus of the protein sequence. Of these five predicted domains, the BAR domain has the lowest E-value of 0.8 and was selected for further processing. (B) Results from the threading run identified the BAR domain from residues 4–224. By clicking on the orange bar, the user gets to the detailed results from the threading run (see C). The BAR domain can also be sent to the Annotation Engine and Threading Engine (link 'Send to ProFAT'). (C) Results from the threading run with the predicted N-terminal BAR domain of APPL1. Threader identified the two structures of Amphiphysin ([PDB:1URU]) and Arfaptin2 ([PDB:1I49]), which are both members of the BAR domain family with nearly 90% confidence.
Figure 6
Figure 6
Evaluation of ProFAT using the domain families PABP, PLAT and HNF-1α. (A) Positive identification of PABP, PLAT and HNF-1α domain family members using HMMerThread and the Annotation Engine. Based on the Superfamily database [20], all members of the PABP, PLAT and HNF-1α family were subjected to high-throughput HMMerThread and Annotation Engine searches. Results show the percentage positive identification of family members using these two different pipelines, as well as the domain search programs HMMer and RPS-BLAST. (B) Keyword-positive hits of PABP, PLAT and HNF-1α domain family members in ProFAT's Annotation Engine. Results show the frequency of keywords identified within the different keyword lists used. Abbreviations used in (A): AE: Annotation Engine.
Figure 7
Figure 7
multiple sequence alignments of weakly conserved domains. (A) Multiple sequence alignment of CH domain family members with human Hook1, Hook2 and Hook3, as well as KPL2 from human, sea urchin and C. reinhardtii. Conserved residues are highlighted in yellow, essential residues are marked with an asterix. (B) Multiple sequence alignment of Eps8 family members with representatives of the SAM domain family, with conserved residues highlighted in yellow. (C) Multiple Sequence alignment of representatives of the RRM domain family with members of the PARN family, as well as human unknown protein LOC84060 and its orthologues from zebrafish and fly. Conserved residues are highlighted in yellow. (D) Multiple sequence alignment of human, fly and worm orthologues of unknown protein LOC79969 with representatives of the acetyltransf_1 domain family. Conserved residues are highlighted in yellow, the catalytically important Tyrosine is marked with an asterix. For accession numbers of all proteins shown in alignments A D, see [Additional file 10].

Similar articles

Cited by

References

    1. Ivanov D, Schleiffer A, Eisenhaber F, Mechtler K, Haering CH, Nasmyth K. Eco1 is a novel acetyltransferase that can acetylate proteins involved in cohesion. Curr Biol. 2002;12:323–328. doi: 10.1016/S0960-9822(02)00681-4. - DOI - PubMed
    1. Rea S, Eisenhaber F, O'Carroll D, Strahl BD, Sun ZW, Schmid M, Opravil S, Mechtler K, Ponting CP, Allis CD, Jenuwein T. Regulation of chromatin structure by site-specific histone H3 methyltransferases. Nature. 2000;406:593–599. doi: 10.1038/35020506. - DOI - PubMed
    1. Miaczynska M, Christoforidis S, Giner A, Shevchenko A, Uttenweiler-Joseph S, Habermann B, Wilm M, Parton RG, Zerial M. APPL proteins link Rab5 to nuclear signal transduction via an endosomal compartment. Cell. 2004;116:445–456. doi: 10.1016/S0092-8674(04)00117-5. - DOI - PubMed
    1. Uhlmann F, Wernic D, Poupart MA, Koonin EV, Nasmyth K. Cleavage of cohesin by the CD clan protease separin triggers anaphase in yeast. Cell. 2000;103:375–386. doi: 10.1016/S0092-8674(00)00130-6. - DOI - PubMed
    1. MacCallum RM, Kelley LA, Sternberg MJ. SAWTED: structure assignment with text description--enhanced detection of remote homologues with automated SWISS-PROT annotation comparisons. Bioinformatics. 2000;16:125–129. doi: 10.1093/bioinformatics/16.2.125. - DOI - PubMed

Publication types

LinkOut - more resources