. 2014 Jun 5;9(6):e95275.

doi: 10.1371/journal.pone.0095275. eCollection 2014.

Identification of divergent protein domains by combining HMM-HMM comparisons and co-occurrence detection

Amel Ghouila¹, Isabelle Florent², Fatma Zahra Guerfali³, Nicolas Terrapon⁴, Dhafer Laouini³, Sadok Ben Yahia⁵, Olivier Gascuel⁶, Laurent Bréhélin⁶

Affiliations

¹ Institut de Biologie Computationnelle, LIRMM, CNRS, Univ. Montpellier 2, Montpellier, France; Computer Science Department, Faculty of Sciences of Tunis, Tunis, Tunisia.
² Centre National de la Recherche Scientifique/Muséum National d'Histoire Naturelle, UMR7245 CNRS-MNHN, Molécules de Communication et Adaptation des Micro-organismes, Adaptation des Protozoaires à leur Environnent, Paris, France.
³ Institut Pasteur de Tunis, LR11IPT02, Laboratory of Transmission, Control and Immunobiology of Infections (LTCII), Tunis-Belvédère, Tunisia; Université Tunis El Manar, Tunis, Tunisia.
⁴ Centre National de la Recherche Scientifique, Aix-Marseille Université, CNRS UMR 7257, AFMB, Marseille, France.
⁵ Computer Science Department, Faculty of Sciences of Tunis, Tunis, Tunisia.
⁶ Institut de Biologie Computationnelle, LIRMM, CNRS, Univ. Montpellier 2, Montpellier, France.

PMID: 24901648
PMCID: PMC4046975
DOI: 10.1371/journal.pone.0095275

Identification of divergent protein domains by combining HMM-HMM comparisons and co-occurrence detection

Amel Ghouila et al. PLoS One. 2014.

. 2014 Jun 5;9(6):e95275.

doi: 10.1371/journal.pone.0095275. eCollection 2014.

Authors

Amel Ghouila¹, Isabelle Florent², Fatma Zahra Guerfali³, Nicolas Terrapon⁴, Dhafer Laouini³, Sadok Ben Yahia⁵, Olivier Gascuel⁶, Laurent Bréhélin⁶

Affiliations

¹ Institut de Biologie Computationnelle, LIRMM, CNRS, Univ. Montpellier 2, Montpellier, France; Computer Science Department, Faculty of Sciences of Tunis, Tunis, Tunisia.
² Centre National de la Recherche Scientifique/Muséum National d'Histoire Naturelle, UMR7245 CNRS-MNHN, Molécules de Communication et Adaptation des Micro-organismes, Adaptation des Protozoaires à leur Environnent, Paris, France.
³ Institut Pasteur de Tunis, LR11IPT02, Laboratory of Transmission, Control and Immunobiology of Infections (LTCII), Tunis-Belvédère, Tunisia; Université Tunis El Manar, Tunis, Tunisia.
⁴ Centre National de la Recherche Scientifique, Aix-Marseille Université, CNRS UMR 7257, AFMB, Marseille, France.
⁵ Computer Science Department, Faculty of Sciences of Tunis, Tunis, Tunisia.
⁶ Institut de Biologie Computationnelle, LIRMM, CNRS, Univ. Montpellier 2, Montpellier, France.

PMID: 24901648
PMCID: PMC4046975
DOI: 10.1371/journal.pone.0095275

Abstract

Identification of protein domains is a key step for understanding protein function. Hidden Markov Models (HMMs) have proved to be a powerful tool for this task. The Pfam database notably provides a large collection of HMMs which are widely used for the annotation of proteins in sequenced organisms. This is done via sequence/HMM comparisons. However, this approach may lack sensitivity when searching for domains in divergent species. Recently, methods for HMM/HMM comparisons have been proposed and proved to be more sensitive than sequence/HMM approaches in certain cases. However, these approaches are usually not used for protein domain discovery at a genome scale, and the benefit that could be expected from their utilization for this problem has not been investigated. Using proteins of P. falciparum and L. major as examples, we investigate the extent to which HMM/HMM comparisons can identify new domain occurrences not already identified by sequence/HMM approaches. We show that although HMM/HMM comparisons are much more sensitive than sequence/HMM comparisons, they are not sufficiently accurate to be used as a standalone complement of sequence/HMM approaches at the genome scale. Hence, we propose to use domain co-occurrence--the general domain tendency to preferentially appear along with some favorite domains in the proteins--to improve the accuracy of the approach. We show that the combination of HMM/HMM comparisons and co-occurrence domain detection boosts protein annotations. At an estimated False Discovery Rate of 5%, it revealed 901 and 1098 new domains in Plasmodium and Leishmania proteins, respectively. Manual inspection of part of these predictions shows that it contains several domain families that were missing in the two organisms. All new domain occurrences have been integrated in the EuPathDomains database, along with the GO annotations that can be deduced.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. Number of sequenced genomes and domain coverage in the Eukaryote tree.**
This figure reports the number of genomes entirely sequenced in each of the 5 supergroups of the Eukaryote tree . In each group, a few sequenced genomes are provided as example, along with statistics relative to Pfam domains (release 26): the proportion of proteins where at least one Pfam domain has been identified using recommended Pfam score thresholds (above), and the proportion of amino acids covered by a Pfam domain (below). Most of the genomes sequenced to date belong to the Unikont (241) and plant (60) super-groups. We can see that there is a marked difference in the protein domain coverage between these groups and the three other groups: while the proportion of proteins where at least one known Pfam domain is usually above 70% in Unikonts and plants, it lies between 50% and 60% in the other groups. Similarly, while the proportion of amino-acids covered by a Pfam domain is often above 40% in plants and Unikonts, it is around 22% in the other supergroups.

**Figure 2. Sensitivity and accuracy of HHPRED and HMMER for *P. falciparum* and *L. major*.**
Number of new domains (x-axis) identified by HHPRED (green) and HMMER (blue) using local (left) and global (right) alignments for various FDRs (y-axis). For each approach, the two plain lines represent an upper and lower FDR estimate (see Methods for details). Dashed lines represent the standard error associated with these two estimates. For the sake of clarity, only the standard error above (resp. below) the upper (resp. lower) FDR estimate are represented here.

**Figure 3. Sensitivity and accuracy of HHPRED+CODD and HMMER+CODD using the known Pfam domain occurrences for certifications.**
This figure reports the number of new domains (x-axis) certified by HHPRED+CODD (in orange and green for the phylum specific and non-specific approaches, respectively) and HMMER+CODD (blue) using local (left) and global (right) alignments for various FDR thresholds (y-axis).

See this image and copyright information in PMC

References

1. Bréhélin L, Florent I, Gascuel O, Maréchal E (2010) Assessing functional annotation transfers with inter-species conserved coexpression: application to plasmodium falciparum. BMC Genomics 11. - PMC - PubMed
1. Ghouila A, Terrapon N, Gascuel O, Guerfali F, Laouini D, et al. (2010) Eupathdomains: The divergent domain database for eukaryotic pathogens. Infect Genet Evol 11: 698–707. - PubMed
1. Richardson J (1981) The anatomy and taxonomy of protein structure. Adv Protein Chem 34: 167–339. - PubMed
1. Hegyi H, Gerstein M (2001) Annotation transfer for genomics: measuring functional divergence in multi-domain proteins. Genome Res 11: 1632–1640. - PMC - PubMed
1. Rubin G, Yandell M, Wortman J, Gabor MG, Nelson C, et al. (2000) Comparative genomics of the eukaryotes. Science 287: 2204–15. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Identification of divergent protein domains by combining HMM-HMM comparisons and co-occurrence detection

Affiliations

Identification of divergent protein domains by combining HMM-HMM comparisons and co-occurrence detection

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources