. 2006 Oct 23:7:466.

doi: 10.1186/1471-2105-7-466.

ProFAT: a web-based tool for the functional annotation of protein sequences

Charles Richard Bradshaw¹, Vineeth Surendranath, Bianca Habermann

Affiliations

PMID: 17059594
PMCID: PMC1636073
DOI: 10.1186/1471-2105-7-466

ProFAT: a web-based tool for the functional annotation of protein sequences

Charles Richard Bradshaw et al. BMC Bioinformatics. 2006.

. 2006 Oct 23:7:466.

doi: 10.1186/1471-2105-7-466.

Authors

Charles Richard Bradshaw¹, Vineeth Surendranath, Bianca Habermann

Affiliation

¹ Max Planck Institute of Molecular Cell Biology and Genetics, Pfotenhauerstrasse 108, 01307 Dresden, Germany. bradshaw@mpi-cbg.de <bradshaw@mpi-cbg.de>

PMID: 17059594
PMCID: PMC1636073
DOI: 10.1186/1471-2105-7-466

Abstract

Background: The functional annotation of proteins relies on published information concerning their close and remote homologues in sequence databases. Evidence for remote sequence similarity can be further strengthened by a similar biological background of the query sequence and identified database sequences. However, few tools exist so far, that provide a means to include functional information in sequence database searches.

Results: We present ProFAT, a web-based tool for the functional annotation of protein sequences based on remote sequence similarity. ProFAT combines sensitive sequence database search methods and a fold recognition algorithm with a simple text-mining approach. ProFAT extracts identified hits based on their biological background by keyword-mining of annotations, features and most importantly, literature associated with a sequence entry. A user-provided keyword list enables the user to specifically search for weak, but biologically relevant homologues of an input query. The ProFAT server has been evaluated using the complete set of proteins from three different domain families, including their weak relatives and could correctly identify between 90% and 100% of all domain family members studied in this context. ProFAT has furthermore been applied to a variety of proteins from different cellular contexts and we provide evidence on how ProFAT can help in functional prediction of proteins based on remotely conserved proteins.

Conclusion: By employing sensitive database search programs as well as exploiting the functional information associated with database sequences, ProFAT can detect remote, but biologically relevant relationships between proteins and will assist researchers in the prediction of protein function based on remote homologies.

PubMed Disclaimer

Figures

**Figure 1**
**Workflow of a ProFAT Analysis**. **(A)** A protein sequence and a keyword list are required inputs for a ProFAT analysis. The first step carried out by ProFAT is a domain search (RPS-BLAST) against the CDD-database from the NCBI. If no conserved domain is detected with RPS-BLAST, the user can proceed to domain prediction (A, right figure), which combines a RPS-BLAST search with relaxed parameters with a BLAST-search and subsequent text-mining for the biological relevance of identified hits. Alternatively, the user can choose to split the sequence into fragments between 150 and 300 amino acids for further processing. Selected conserved domains and/or regions of the input query can then be submitted to the *Annotation Engine* and/or *Threading Engine*. The *Annotation Engine* combines a PSI-BLAST search with text-mining of Gene Ontology annotation, features and PubMed abstracts associated with identified hits, thereby extracting hits involved in the process/function described by the user's keyword list. The *Threading Engine* combines a Threader 3.5 run with text-mining of associated PDB-keywords, features, compound information and PubMed abstracts of identified structures for post-filtering using keywords from the user-provided keyword list. **(B)** *HMMerThread* pipeline. *HMMerThread* combines a HMMer-search against the PFAM-database of conserved domains with a Threader run. The input query is first sent to an HMMer-search, whereby only domains with an associated 3D-structure are chosen for further processing. Selected domains are then sent to Threader 3.5, with prior secondary structure prediction (PSI-PRED), coiled-coil prediction (COILS2) and low-complexity filtering (SEG), which are all performed on the entire input sequence to achieve higher accuracy. *HMMerThread* therefore can give a highly accurate prediction of conserved domains.

**Figure 2**
**Domain search and domain prediction using ProFAT**. **(A)** Results for Dip13α/APPL1 [GenBank:NP_036228] from a ProFAT domain search. RPS-BLAST identified a PH-domain and a PTB-domain in the input query, the N-terminal region does not contain any conserved domains with the chosen E-value cutoff (E <= 1E-04). The upper window gives the user a description of the domain as found in CDD by mousing over the domain box. The image represents the sequence with identified conserved domains. The table at the bottom lists the identified domains with their amino acid boundaries in the input sequence. By either activating the checkboxes or by clicking on the region and/or conserved domain on the image, the user can select conserved domains/regions for further processing by the *Annotation Engine* and/or *Threading Engine* (selectable by a check-box and activated by pressing the 'Submit' box). In this case, the N-terminal region from amino acid 1 to 280 was selected for further processing. **(B)** If no domain was identified, the user can perform *Domain Prediction*. In this case, a RPS-BLAST search with an E-value cutoff of 100 is used to identify weak domain hits. The consensus sequences of these domains are, in turn, submitted to a regular BLAST-search with subsequent text-mining for keywords occurring in the user-provided keyword list. In the case of the Dip13α/APPL1 N-terminal domain, RPS-BLAST finds SMC-domains, Biopterin_H, as well as a COG-domain. The identified domains can be submitted to the *Annotation Engine* and *Threading Engine* for a more detailed analysis (link 'Send to ProFAT').

**Figure 3**
**Results from the Core Modules of ProFAT**. **(A)** Graphical and tabular representation of results from the *Annotation Engine* and *Threading Engine* (Dip13α/APPL1 [GenBank:NP_036228] was used as a query). Red bars in the image represent identified database sequences that contain one or more keywords from the user-provided list in their annotation, blue bars represent sequences where no keywords were detected. The upper bars show results from the *Annotation Engine*, the lower bars those from the *Threading Engine*. The table below the image gives the user the number of hits with and without keywords, links to the raw results, tabular information on the frequency of observed GO-terms, as well as the starting and ending residue of the region and conserved domains in the input query. The numbers in the column 'Keyword Hits' link to the annotated alignments of keyword-positive database entries. Moving the mouse over the respective number changes the format of the graph to the image seen in (B), whereby alignments are represented by narrow lines. The number in 'Total Hits' links to the complete PSI-BLAST output, whereby each alignment is annotated with the associated information of the database hit. **(B)** Graphical output of the region 1 – 280 of the input query from the *Annotation Engine*. **(C)** Representative alignment of one of the identified hits that shows biological relevance next to sequence similarity. Each sequence that has been identified by PSI-BLAST is annotated with associated GenBank features, PubMed abstracts and Gene Ontology terms, as well as its sequence. Associated information can be individually viewed by clicking on the '+' sign next to the respective information.

**Figure 4**
Typical results from ProFAT's *Threading Engine*. **(A)** Threader alignment of the PH-domain of Dip13α/APPL1 [GenBank:NP_036228]. The *Threading Engine* picked up the crystal structure of the PH-domain of the protein Tiam1 ([PDB:1FOE]). Secondary structure elements are shown above the identified hit. The CATH ID, the threading score, as well as the PDB-ID are given underneath the alignment. The features, abstracts of associated publications, PDB compound information and the PDB-keywords can be individually visualized by the user. In this case, the abstract of the associated paper of 1FOE, as well as the PDB-keywords are shown. **(B)** Processed results of the Threader-output. In this case, the top five hits are shown, including their score, function, compound and keyword information.

**Figure 5**
**Typical Output of an *HMMerThread* run**. **(A)** Results from the HMMer-search against the PFAM conserved domain database. The input query was Dip13α/APPL1 [GenBank:NP_036228]. HMMer identified next to the PH- and PTB-domain 5 potential conserved domains in the N-terminus of the protein sequence. Of these five predicted domains, the BAR domain has the lowest E-value of 0.8 and was selected for further processing. **(B)** Results from the threading run identified the BAR domain from residues 4–224. By clicking on the orange bar, the user gets to the detailed results from the threading run (see C). The BAR domain can also be sent to the *Annotation Engine* and *Threading Engine* (link 'Send to ProFAT'). **(C)** Results from the threading run with the predicted N-terminal BAR domain of APPL1. Threader identified the two structures of Amphiphysin ([PDB:1URU]) and Arfaptin2 ([PDB:1I49]), which are both members of the BAR domain family with nearly 90% confidence.

**Figure 6**
**Evaluation of ProFAT using the domain families PABP, PLAT and HNF-1α**. **(A)** Positive identification of PABP, PLAT and HNF-1α domain family members using *HMMerThread* and the *Annotation Engine*. Based on the Superfamily database [20], all members of the PABP, PLAT and HNF-1α family were subjected to high-throughput *HMMerThread* and *Annotation Engine* searches. Results show the percentage positive identification of family members using these two different pipelines, as well as the domain search programs HMMer and RPS-BLAST. **(B)** Keyword-positive hits of PABP, PLAT and HNF-1α domain family members in ProFAT's *Annotation Engine*. Results show the frequency of keywords identified within the different keyword lists used. Abbreviations used in (A): AE: *Annotation Engine*.

**Figure 7**
**multiple sequence alignments of weakly conserved domains**. **(A)** Multiple sequence alignment of CH domain family members with human Hook1, Hook2 and Hook3, as well as KPL2 from human, sea urchin and *C. reinhardtii*. Conserved residues are highlighted in yellow, essential residues are marked with an asterix. **(B)** Multiple sequence alignment of Eps8 family members with representatives of the SAM domain family, with conserved residues highlighted in yellow. **(C)** Multiple Sequence alignment of representatives of the RRM domain family with members of the PARN family, as well as human unknown protein LOC84060 and its orthologues from zebrafish and fly. Conserved residues are highlighted in yellow. **(D)** Multiple sequence alignment of human, fly and worm orthologues of unknown protein LOC79969 with representatives of the acetyltransf_1 domain family. Conserved residues are highlighted in yellow, the catalytically important Tyrosine is marked with an asterix. For accession numbers of all proteins shown in alignments A – D, see [Additional file 10].

See this image and copyright information in PMC

Cited by

Cold-Induced Reprogramming of Subcutaneous White Adipose Tissue Assessed by Single-Cell and Single-Nucleus RNA Sequencing.
Liu Q, Long Q, Zhao J, Wu W, Lin Z, Sun W, Gu P, Deng T, Loomes KM, Wu D, Kong APS, Zhou J, Cheng AS, Hui HX. Liu Q, et al. Research (Wash D C). 2023 Jun 28;6:0182. doi: 10.34133/research.0182. eCollection 2023. Research (Wash D C). 2023. PMID: 37398933 Free PMC article.
Improving classification in protein structure databases using text mining.
Koussounadis A, Redfern OC, Jones DT. Koussounadis A, et al. BMC Bioinformatics. 2009 May 5;10:129. doi: 10.1186/1471-2105-10-129. BMC Bioinformatics. 2009. PMID: 19416501 Free PMC article.
HMMerThread: detecting remote, functional conserved domains in entire genomes by combining relaxed sequence-database searches with fold recognition.
Bradshaw CR, Surendranath V, Henschel R, Mueller MS, Habermann BH. Bradshaw CR, et al. PLoS One. 2011 Mar 10;6(3):e17568. doi: 10.1371/journal.pone.0017568. PLoS One. 2011. PMID: 21423752 Free PMC article.

References

1. Ivanov D, Schleiffer A, Eisenhaber F, Mechtler K, Haering CH, Nasmyth K. Eco1 is a novel acetyltransferase that can acetylate proteins involved in cohesion. Curr Biol. 2002;12:323–328. doi: 10.1016/S0960-9822(02)00681-4. - DOI - PubMed
1. Rea S, Eisenhaber F, O'Carroll D, Strahl BD, Sun ZW, Schmid M, Opravil S, Mechtler K, Ponting CP, Allis CD, Jenuwein T. Regulation of chromatin structure by site-specific histone H3 methyltransferases. Nature. 2000;406:593–599. doi: 10.1038/35020506. - DOI - PubMed
1. Miaczynska M, Christoforidis S, Giner A, Shevchenko A, Uttenweiler-Joseph S, Habermann B, Wilm M, Parton RG, Zerial M. APPL proteins link Rab5 to nuclear signal transduction via an endosomal compartment. Cell. 2004;116:445–456. doi: 10.1016/S0092-8674(04)00117-5. - DOI - PubMed
1. Uhlmann F, Wernic D, Poupart MA, Koonin EV, Nasmyth K. Cleavage of cohesin by the CD clan protease separin triggers anaphase in yeast. Cell. 2000;103:375–386. doi: 10.1016/S0092-8674(00)00130-6. - DOI - PubMed
1. MacCallum RM, Kelley LA, Sternberg MJ. SAWTED: structure assignment with text description--enhanced detection of remote homologues with automated SWISS-PROT annotation comparisons. Bioinformatics. 2000;16:125–129. doi: 10.1093/bioinformatics/16.2.125. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

ProFAT: a web-based tool for the functional annotation of protein sequences

Affiliation

ProFAT: a web-based tool for the functional annotation of protein sequences

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources