Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Mar 10;6(3):e17568.
doi: 10.1371/journal.pone.0017568.

HMMerThread: detecting remote, functional conserved domains in entire genomes by combining relaxed sequence-database searches with fold recognition

Affiliations

HMMerThread: detecting remote, functional conserved domains in entire genomes by combining relaxed sequence-database searches with fold recognition

Charles Richard Bradshaw et al. PLoS One. .

Abstract

Conserved domains in proteins are one of the major sources of functional information for experimental design and genome-level annotation. Though search tools for conserved domain databases such as Hidden Markov Models (HMMs) are sensitive in detecting conserved domains in proteins when they share sufficient sequence similarity, they tend to miss more divergent family members, as they lack a reliable statistical framework for the detection of low sequence similarity. We have developed a greatly improved HMMerThread algorithm that can detect remotely conserved domains in highly divergent sequences. HMMerThread combines relaxed conserved domain searches with fold recognition to eliminate false positive, sequence-based identifications. With an accuracy of 90%, our software is able to automatically predict highly divergent members of conserved domain families with an associated 3-dimensional structure. We give additional confidence to our predictions by validation across species. We have run HMMerThread searches on eight proteomes including human and present a rich resource of remotely conserved domains, which adds significantly to the functional annotation of entire proteomes. We find ∼4500 cross-species validated, remotely conserved domain predictions in the human proteome alone. As an example, we find a DNA-binding domain in the C-terminal part of the A-kinase anchor protein 10 (AKAP10), a PKA adaptor that has been implicated in cardiac arrhythmias and premature cardiac death, which upon stress likely translocates from mitochondria to the nucleus/nucleolus. Based on our prediction, we propose that with this HLH-domain, AKAP10 is involved in the transcriptional control of stress response. Further remotely conserved domains we discuss are examples from areas such as sporulation, chromosome segregation and signalling during immune response. The HMMerThread algorithm is able to automatically detect the presence of remotely conserved domains in proteins based on weak sequence similarity. Our predictions open up new avenues for biological and medical studies. Genome-wide HMMerThread domains are available at http://vm1-hmmerthread.age.mpg.de.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Architecture of genome-wide HMMerThread searches.
Each protein of a species' proteome is sent to a conserved domain search using HMMER2 against the Pfam database with an E-value threshold of 50. If a conserved domain with an E-value below 1e-04 is detected, it is positively scored. In case an identified domain has an E-value above 1e-04, a pre-processing and fold recognition step is performed. In case of a positive identification (p<0.001), a conserved domain is scored, if the HMMER2 E-value of the conserved domain is below 0.1. If the HMMER2 E-value is above 0.1 and the associated fold has been scored positively, a cross-species validation is performed and essential residues are flagged for a confident assignment of a conserved domain.
Figure 2
Figure 2. Performance of the OpenProspect software.
Comparison of positive identifications of conserved domains using either HMMER2 alone (grey bars) or HMMerThread (red bars). We have tested an E-value range between 1e-20 and 1e-04 for positive identification of conserved domains by HMMerThread and 88% of conserved domains could be positively identified.
Figure 3
Figure 3. Multiple sequence alignments of remotely conserved domains in proteins identified in functional screens.
Multiple sequence alignment of the Nab1 family with the SAM domain family (taken from CDD). Residues that are conserved between the two families are highlighted in yellow, those found in only one of them are highlighted in blue and green, respectively. Essential, functional residues retrieved from the CD database are indicated by hash keys. Accession numbers of sequences can be found in Supplemental Table S7.
Figure 4
Figure 4. Multiple sequence alignments of remotely conserved domains in proteins associated with mitosis and meiosis.
(A) Multiple sequence alignment of the Ssp2 family with the RRM_1 domain. (B) Multipe sequence alignment of the Wapl/Rad61 family with the SAP domain family. Residues that are conserved between the two families are highlighted in yellow, those found in only one of them are highlighted in blue and green, respectively. Essential, functional residues retrieved from the CD database are indicated by hash keys, those retrieved from literature (SAP domain) with stars. Accession numbers of sequences can be found in Supplemental Table S7.
Figure 5
Figure 5. Multiple sequence alignments of remotely conserved domains found in proteins associated with human diseases.
(A) Multiple sequence alignment of the AKAP10 family with the CUT-like HLH domain family. (B) Multiple sequence alignment of the Lba protein family with the VHS domain family. Residues that are conserved between the two families are highlighted in yellow, those found in only one of them are highlighted in blue and green, respectively. Essential, functional residues retrieved from the CD database are indicated by hash keys. Accession numbers of sequences can be found in Supplemental Table S7.
Figure 6
Figure 6. Multiple sequence alignment of the Als2cL family with the RhoGEF domain family.
Residues that are conserved between the two families are highlighted in yellow, those found in only one of them are highlighted in blue and green, respectively. Essential, functional residues retrieved from the CD database are indicated by hash keys. Accession numbers of sequences can be found in Supplemental Table S7.

Similar articles

Cited by

References

    1. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, et al. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science. 1993;262:208–214. - PubMed
    1. Marchler-Bauer A, Anderson JB, DeWeese-Scott C, Fedorova ND, Geer LY, et al. CDD: a curated Entrez database of conserved domain alignments. Nucleic Acids Res. 2003;31:383–387. - PMC - PubMed
    1. Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, et al. The Pfam protein families database. Nucleic Acids Res. 2008;36:D281–288. - PMC - PubMed
    1. Letunic I, Goodstadt L, Dickens NJ, Doerks T, Schultz J, et al. Recent improvements to the SMART domain-based sequence annotation resource. Nucleic Acids Res. 2002;30:242–244. - PMC - PubMed
    1. Eddy SR. Profile hidden Markov models. Bioinformatics. 1998;14:755–763. - PubMed

Publication types