Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Feb 15:13:33.
doi: 10.1186/1471-2105-13-33.

BLANNOTATOR: enhanced homology-based function prediction of bacterial proteins

Affiliations

BLANNOTATOR: enhanced homology-based function prediction of bacterial proteins

Matti Kankainen et al. BMC Bioinformatics. .

Abstract

Background: Automated function prediction has played a central role in determining the biological functions of bacterial proteins. Typically, protein function annotation relies on homology, and function is inferred from other proteins with similar sequences. This approach has become popular in bacterial genomics because it is one of the few methods that is practical for large datasets and because it does not require additional functional genomics experiments. However, the existing solutions produce erroneous predictions in many cases, especially when query sequences have low levels of identity with the annotated source protein. This problem has created a pressing need for improvements in homology-based annotation.

Results: We present an automated method for the functional annotation of bacterial protein sequences. Based on sequence similarity searches, BLANNOTATOR accurately annotates query sequences with one-line summary descriptions of protein function. It groups sequences identified by BLAST into subsets according to their annotation and bases its prediction on a set of sequences with consistent functional information. We show the results of BLANNOTATOR's performance in sets of bacterial proteins with known functions. We simulated the annotation process for 3090 SWISS-PROT proteins using a database in its state preceding the functional characterisation of the query protein. For this dataset, our method outperformed the five others that we tested, and the improved performance was maintained even in the absence of highly related sequence hits. We further demonstrate the value of our tool by analysing the putative proteome of Lactobacillus crispatus strain ST1.

Conclusions: BLANNOTATOR is an accurate method for bacterial protein function prediction. It is practical for genome-scale data and does not require pre-existing sequence clustering; thus, this method suits the needs of bacterial genome and metagenome researchers. The method and a web-server are available at http://ekhidna.biocenter.helsinki.fi/poxo/blannotator/.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Schematic representation of the construction of the SWISS-PROT dataset used to assess the performance of automated protein function assignment methods. Test proteins (dark grey boxes) were initially selected by extraction from the entire SWISS-PROT database. The extraction protocol involved the removal of non-bacterial entries, the removal of entries created ahead of 2005, the removal of entries with words 'UPF' or 'uncharacterized', the selection of entries added directly to SWISS-PROT or that had undergone revision since their storage in TrEMBL, the removal of similarly annotated entries and the removal of entries showing sequence similarity to each other. The construction of the sequence similarity search results (light grey boxes) for functional inference included a sequence comparison against UniProt with BLAST, the removal of BLAST hits to sequences for which the creation date was newer or equal to than the annotation date of the query sequence and the restoration of the annotations of the remaining BLAST hits to their status just before the annotation date. Barrels show the number of entries and BLAST hits that passed each filtering step, and the intensity of the red colour indicates the corresponding fractions. White boxes in the crossing area show the annotation (DE) and the annotation date (DT) for two test sequences. Red crosses indicate BLAST hits that were removed.
Figure 2
Figure 2
The proportion of correctly annotated BLAST hits in the SWISS-PROT dataset. The mean of proportions of BLAST hits with function descriptions similar to those of their query sequences was recorded before (grey bars) and after (brown bars) the removal of circular referencing. Statistics was computed using all GO terms (darker bars) and by accepting only GO terms with experimental or computational evidence codes (lighter bars). BLAST hits were selected based on DE annotation, GO annotation, GO or DE annotation, GO and DE annotation, and BLANNOTATOR. The reported means were calculated over all test protein sequences.
Figure 3
Figure 3
Mean annotation quality of the ideal and worst-case predictions in SWISS-PROT dataset. Modified Levenshtein distance-based statistics is shown for the ideal (panels A and C) and worst-case (panels B and D) predictions after removing BLAST hits at various sequence identity and alignment coverage thresholds. Panels A and B show the statistics for a dataset from which circularly referenced annotations had been removed, and panels C and D show the statistics for a dataset in the presence of circularly referenced protein annotations. Red colours indicate bad predictions and blue colours good predictions.
Figure 4
Figure 4
The quality of automated protein function predictions. The fraction of predictions below a certain modified Levenshtein distance from the correct annotation is shown. Function prediction was performed from an unfiltered BLAST match list and a list from which circularly referenced annotations had been removed. Function prediction was based on the most significant BLAST match (purple), the top BLAST match without any uninformative words (green), the most common annotation among BLAST hits (blues), the annotation associated with the highest bit score sum (yellow), a word-based scoring scheme (brown) and BLANNOTATOR (red). Dashed lines show the performance of the tool when applied to the largest group of matches sharing a common GO annotation. The black dashed lines and the grey background indicates the theoretical level of performance when the ideal, median or worst-case predictions were chosen.
Figure 5
Figure 5
Outline of the BLANNOTATOR method. Related sequences (coloured bars) are detected using BLAST against UniProt, GO annotation information (grey circles) is extracted from GOA, sequence hits are organised into groups according to their DE and GO annotation and DE annotations are scored. In the example shown, 15 BLAST hits, described by four DE annotations (red, cyan, violet and dark blue bars) and two GO annotations (the grey circle diagrams), are split into two clusters. Initial BLAST bit scores, as well the final and intermediate scores, are shown for the larger of the two clusters.
Figure 6
Figure 6
A screenshot from the BLANNOTATOR web server showing the results page. The BLAST hits to the protein sequence LCRIS_00067 were assigned to two groups, reflecting the fact that matching proteins are involved in two molecular functions. The results indicate that the protein could be described as either 'Phospho-beta-glycosidase' or as '1-acyl-sn-glycerol-3-phosphate acyltransferase'. The first function describes proteins that have a hydrolase activity and act on glycosyl bonds, whereas the second function describes proteins that have transferase activity and transfer acyl groups other than amino-acyl groups. Our tool suggests that the protein of interest is more likely to have the hydrolase activity.

Similar articles

Cited by

References

    1. Liolios K, Chen IM, Mavromatis K, Tavernarakis N, Hugenholtz P, Markowitz VM, Kyrpides NC. The Genomes On Line Database (GOLD) in 2009: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 2010;38:D346–D354. doi: 10.1093/nar/gkp848. - DOI - PMC - PubMed
    1. Tringe SG, Rubin EM. Metagenomics: DNA sequencing of environmental samples. Nat Rev Genet. 2005;6:805–814. doi: 10.1038/nrg1709. - DOI - PubMed
    1. UniProt Consortium. The Universal Protein Resource (UniProt) Nucleic Acids Res. 2008;35:D190–195. - PMC - PubMed
    1. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Ka-sarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000;25:25–29. doi: 10.1038/75556. - DOI - PMC - PubMed
    1. Friedberg I. Automated protein function prediction--the genomic challenge. Brief Bioinform. 2006;7:225–42. doi: 10.1093/bib/bbl004. - DOI - PubMed

Publication types

MeSH terms

Substances