Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Jan;37(Database issue):D471-8.
doi: 10.1093/nar/gkn661. Epub 2008 Oct 11.

HAMAP: a database of completely sequenced microbial proteome sets and manually curated microbial protein families in UniProtKB/Swiss-Prot

Affiliations

HAMAP: a database of completely sequenced microbial proteome sets and manually curated microbial protein families in UniProtKB/Swiss-Prot

Tania Lima et al. Nucleic Acids Res. 2009 Jan.

Abstract

The growth in the number of completely sequenced microbial genomes (bacterial and archaeal) has generated a need for a procedure that provides UniProtKB/Swiss-Prot-quality annotation to as many protein sequences as possible. We have devised a semi-automated system, HAMAP (High-quality Automated and Manual Annotation of microbial Proteomes), that uses manually built annotation templates for protein families to propagate annotation to all members of manually defined protein families, using very strict criteria. The HAMAP system is composed of two databases, the proteome database and the family database, and of an automatic annotation pipeline. The proteome database comprises biological and sequence information for each completely sequenced microbial proteome, and it offers several tools for CDS searches, BLAST options and retrieval of specific sets of proteins. The family database currently comprises more than 1500 manually curated protein families and their annotation templates that are used to annotate proteins that belong to one of the HAMAP families. On the HAMAP website, individual sequences as well as whole genomes can be scanned against all HAMAP families. The system provides warnings for the absence of conserved amino acid residues, unusual sequence length, etc. Thanks to the implementation of HAMAP, more than 200,000 microbial proteins have been fully annotated in UniProtKB/Swiss-Prot (HAMAP website: http://www.expasy.org/sprot/hamap).

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Example of a HAMAP protein family annotation template (family rule), MF_00074 (http://www.expasy.org/unirule/MF_00074). Annotation templates contain three sections: ‘General rule information’, ‘Propagated annotation’ and ‘Additional information’. General information comprises: family identification number (MF_xxxxx), dates of creation and revision, ‘Data class’, i.e. that the whole protein is annotated by the family rule and not only a specific domain, and ‘Predictors’, which contain the distribution of matches and the alignment that was used to generate the family profile. The ‘Propagated annotation’ section contains the information that is propagated to all members of a protein family, or to some, if the field is preceded by ‘cases’ or ‘conditions’. For MF_00074, the function field will be different depending on the taxonomic origin, but all proteins will have ‘Cytoplasm’ as subcellular location and all belong to the family ‘RNA methyltransferase rsmG’. It also contains cross-references to other protein family databases, such as Pfam and TIGRFAMS, and manually selected GO terms. Additional information includes the size range of members of this family, if there are protein families related to this one, the list of characterized protein(s) that were used to compile information for the creation of the protein family and its annotation template (for MF_00074, literature is found for the proteins of E. coli, Bacillus subtilis, Microbacterium tuberculosis and Streptomyces coelicolor), the scope, i.e. the taxonomic groups covered by this family, if in at least one member this protein is fused to another protein either in the N-terminal or C-terminal region, and whether there are duplicates or whether in some species the protein is encoded on a plasmid. In the ‘UniProtKB rule member sequences’ section, complete sets of member proteins can be retrieved, taxonomic distribution can be browsed, and specific sets of proteins can be retrieved.
Figure 2.
Figure 2.
Examples of uses of the conditional statements ‘case’ and ‘conditions’ in family annotation templates (family rules). MF_00112 (http://www.expasy.org/unirule/MF_00112): an example of ID/protein name/gene name propagation depending on taxonomic distinction. In archaea, no gene name has been assigned but enzyme function has been proven in several different species, whereas the gene name pcrB is used only in Bacillales, with a function that has only been suggested for B. subtilis. Note that the reaction catalyzed by the archaeal protein has no biological significance in bacteria, since GGGP is a specific precursor of archaeal membrane lipids. MF_01544 (http://www.expasy.org/unirule/MF_01544): Subcellular location is predicted based on the number of membranes the bacterium possesses. MF_01624 (http://www.expasy.org/unirule/MF_01624): an example of conditions used for active site and disulfide bond feature propagation. If the indicated amino acid(s) are not present in the appropriate position(s) in the sequence, the feature is not propagated and a warning is generated, necessitating manual intervention. MF_01339 (http://www.expasy.org/unirule/MF_01339): an example of active site, metal and modified residue feature propagation. In the last two examples, the template entry used to derive the information is also indicated.
Figure 3.
Figure 3.
The HAMAP annotation pipeline. UniProtKB/TrEMBL complete proteome entries matching a HAMAP family detection profile (derived from an alignment of manually selected family members; matches to those profiles are stored in a ‘Match database’, allowing assignment of family membership) are passed through a ‘template engine’ that applies the annotation found in the corresponding HAMAP annotation template (and resolving its conditional statements) to generate UniProtKB/Swiss-Prot annotation. If the system generates warnings, or if the matching score is low, the entry is channelled to manual annotation; entries without warnings are directly integrated into UniProtKB/Swiss-Prot. UniProtKB entries for which there is available literature are manually annotated.

References

    1. Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb J.-F, Dougherty BA, Merrick JM, et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science. 1995;269:496–512. - PubMed
    1. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–380. - PMC - PubMed
    1. Bennett S. Solexa Ltd. Pharmacogenomics. 2004;5:433–438. - PubMed
    1. Sanger F, Nicklen S, Coulson AR. DNA sequencing with chain-terminating inhibitors. Proc. Natl Acad. Sci. USA. 1977;74:5463–5467. - PMC - PubMed
    1. Stothard P, Wishart DS. Automated bacterial genome analysis and annotation. Curr. Opin. Microbiol. 2006;9:505–510. - PubMed

Publication types