Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jun 5;9(6):e95275.
doi: 10.1371/journal.pone.0095275. eCollection 2014.

Identification of divergent protein domains by combining HMM-HMM comparisons and co-occurrence detection

Affiliations

Identification of divergent protein domains by combining HMM-HMM comparisons and co-occurrence detection

Amel Ghouila et al. PLoS One. .

Abstract

Identification of protein domains is a key step for understanding protein function. Hidden Markov Models (HMMs) have proved to be a powerful tool for this task. The Pfam database notably provides a large collection of HMMs which are widely used for the annotation of proteins in sequenced organisms. This is done via sequence/HMM comparisons. However, this approach may lack sensitivity when searching for domains in divergent species. Recently, methods for HMM/HMM comparisons have been proposed and proved to be more sensitive than sequence/HMM approaches in certain cases. However, these approaches are usually not used for protein domain discovery at a genome scale, and the benefit that could be expected from their utilization for this problem has not been investigated. Using proteins of P. falciparum and L. major as examples, we investigate the extent to which HMM/HMM comparisons can identify new domain occurrences not already identified by sequence/HMM approaches. We show that although HMM/HMM comparisons are much more sensitive than sequence/HMM comparisons, they are not sufficiently accurate to be used as a standalone complement of sequence/HMM approaches at the genome scale. Hence, we propose to use domain co-occurrence--the general domain tendency to preferentially appear along with some favorite domains in the proteins--to improve the accuracy of the approach. We show that the combination of HMM/HMM comparisons and co-occurrence domain detection boosts protein annotations. At an estimated False Discovery Rate of 5%, it revealed 901 and 1098 new domains in Plasmodium and Leishmania proteins, respectively. Manual inspection of part of these predictions shows that it contains several domain families that were missing in the two organisms. All new domain occurrences have been integrated in the EuPathDomains database, along with the GO annotations that can be deduced.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Number of sequenced genomes and domain coverage in the Eukaryote tree.
This figure reports the number of genomes entirely sequenced in each of the 5 supergroups of the Eukaryote tree . In each group, a few sequenced genomes are provided as example, along with statistics relative to Pfam domains (release 26): the proportion of proteins where at least one Pfam domain has been identified using recommended Pfam score thresholds (above), and the proportion of amino acids covered by a Pfam domain (below). Most of the genomes sequenced to date belong to the Unikont (241) and plant (60) super-groups. We can see that there is a marked difference in the protein domain coverage between these groups and the three other groups: while the proportion of proteins where at least one known Pfam domain is usually above 70% in Unikonts and plants, it lies between 50% and 60% in the other groups. Similarly, while the proportion of amino-acids covered by a Pfam domain is often above 40% in plants and Unikonts, it is around 22% in the other supergroups.
Figure 2
Figure 2. Sensitivity and accuracy of HHPRED and HMMER for P. falciparum and L. major.
Number of new domains (x-axis) identified by HHPRED (green) and HMMER (blue) using local (left) and global (right) alignments for various FDRs (y-axis). For each approach, the two plain lines represent an upper and lower FDR estimate (see Methods for details). Dashed lines represent the standard error associated with these two estimates. For the sake of clarity, only the standard error above (resp. below) the upper (resp. lower) FDR estimate are represented here.
Figure 3
Figure 3. Sensitivity and accuracy of HHPRED+CODD and HMMER+CODD using the known Pfam domain occurrences for certifications.
This figure reports the number of new domains (x-axis) certified by HHPRED+CODD (in orange and green for the phylum specific and non-specific approaches, respectively) and HMMER+CODD (blue) using local (left) and global (right) alignments for various FDR thresholds (y-axis).

References

    1. Bréhélin L, Florent I, Gascuel O, Maréchal E (2010) Assessing functional annotation transfers with inter-species conserved coexpression: application to plasmodium falciparum. BMC Genomics 11. - PMC - PubMed
    1. Ghouila A, Terrapon N, Gascuel O, Guerfali F, Laouini D, et al. (2010) Eupathdomains: The divergent domain database for eukaryotic pathogens. Infect Genet Evol 11: 698–707. - PubMed
    1. Richardson J (1981) The anatomy and taxonomy of protein structure. Adv Protein Chem 34: 167–339. - PubMed
    1. Hegyi H, Gerstein M (2001) Annotation transfer for genomics: measuring functional divergence in multi-domain proteins. Genome Res 11: 1632–1640. - PMC - PubMed
    1. Rubin G, Yandell M, Wortman J, Gabor MG, Nelson C, et al. (2000) Comparative genomics of the eukaryotes. Science 287: 2204–15. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources