. 2011 Mar 31:12:90.

doi: 10.1186/1471-2105-12-90.

Using context to improve protein domain identification

Alejandro Ochoa¹, Manuel Llinás, Mona Singh

Affiliations

PMID: 21453511
PMCID: PMC3090354
DOI: 10.1186/1471-2105-12-90

Using context to improve protein domain identification

Alejandro Ochoa et al. BMC Bioinformatics. 2011.

. 2011 Mar 31:12:90.

doi: 10.1186/1471-2105-12-90.

Authors

Alejandro Ochoa¹, Manuel Llinás, Mona Singh

Affiliation

¹ Department of Molecular Biology, Princeton University, Princeton, NJ, USA.

PMID: 21453511
PMCID: PMC3090354
DOI: 10.1186/1471-2105-12-90

Abstract

Background: Identifying domains in protein sequences is an important step in protein structural and functional annotation. Existing domain recognition methods typically evaluate each domain prediction independently of the rest. However, the majority of proteins are multidomain, and pairwise domain co-occurrences are highly specific and non-transitive.

Results: Here, we demonstrate how to exploit domain co-occurrence to boost weak domain predictions that appear in previously observed combinations, while penalizing higher confidence domains if such combinations have never been observed. Our framework, Domain Prediction Using Context (dPUC), incorporates pairwise "context" scores between domains, along with traditional domain scores and thresholds, and improves domain prediction across a variety of organisms from bacteria to protozoa and metazoa. Among the genomes we tested, dPUC is most successful at improving predictions for the poorly-annotated malaria parasite Plasmodium falciparum, for which over 38% of the genome is currently unannotated. Our approach enables high-confidence annotations in this organism and the identification of orthologs to many core machinery proteins conserved in all eukaryotes, including those involved in ribosomal assembly and other RNA processing events, which surprisingly had not been previously known.

Conclusions: Overall, our results demonstrate that this new context-based approach will provide significant improvements in domain and function prediction, especially for poorly understood genomes for which the need for additional annotations is greatest. Source code for the algorithm is available under a GPL open source license at http://compbio.cs.princeton.edu/dpuc/. Pre-computed results for our test organisms and a web server are also available at that location.

PubMed Disclaimer

Figures

**Figure 1**
**Illustration of the dPUC framework using Pfam to identify initial domains**. A. We gather candidate domain predictions using Pfam with a permissive threshold. Domains are arranged in the x-axis by their amino acid coordinates, but the y-axis arrangement is arbitrary (there may be overlapping initial predictions). B. We build a network between candidate domains. Node weights are the normalized Pfam HMM scores of the corresponding domains (raw score minus the domain threshold). Edge weights between non-overlapping domains are set to our context scores. C. The Standard Pfam will make limited predictions, while dPUC may boost weak domains over the thresholds if they are in the correct context. The dPUC solution maximizes the sum of the node and edge weights, without overlaps, and each node must satisfy the Pfam thresholds. The final normalized domain scores are shown for each framework.

**Figure 2**
**dPUC predicts more domains over a range of FDRs**. A. Illustration of the FDR estimation procedure. For each original protein sequence, we make predictions on it and on twenty shuffled sequences concatenated to the original sequence, to allow "real" domains (Y, Z) to boost false predictions on the shuffled sequence (domains V, W, X) when using context. The estimated FDR is the ratio of false predictions per protein to the total number of predictions per protein. In this illustration, FDR ≈ (3/20)/(2) = 7.5%. B. The y-axis is the number of predicted domains per protein ("signal"), while the x-axis is the FDR ("noise"), so better performing methods have higher curves (more signal for a given noise threshold). dPUC (green circles) outperforms all non-context Pfam variations tested and the context method CODD.

**Figure 3**
**dPUC predicts more domains over a range of Ortholog Coherence scores on Plasmodium species**. A. Illustration of scores. Domain predictions are made on hypothetical aligned orthologs and in-paralogs (Pf1, Pf2, Pv1, and Pc1). Color denotes domain family. Domain S overlaps T of the same family, so their scores are 1/3 (since they lack predictions in Pv1 and Pc1). In contrast, U is predicted 100% in its orthologs and in-paralogs. Y overlaps V but is not of the same family, so its score is zero. Similarly, Z does not overlap any domains. The score of this method is the average domain score on all proteins, ~0.58, while the average number of domains per protein is 2. B. The y-axis is the number of predicted domains per protein ("signal"), while the x-axis is the ortholog coherence score (inversely related with "noise"), so better performing methods have higher curves (more signal for a given noise threshold). dPUC (green circles) outperforms the other methods. Symbols and colors are as in **Figure 2**.

See this image and copyright information in PMC

References

1. Stein L. Genome annotation: from sequence to biology. Nat Rev Genet. 2001;2:493–503. doi: 10.1038/35080529. - DOI - PubMed
1. Schug J, Diskin S, Mazzarelli J, Brunk BP, Stoeckert CJ. Predicting Gene Ontology Functions from ProDom and CDD Protein Domains. Genome Res. 2002;12:648–655. doi: 10.1101/gr.222902. - DOI - PMC - PubMed
1. Forslund K, Sonnhammer ELL. Predicting protein function from domain content. Bioinformatics. 2008;24:1681–1687. doi: 10.1093/bioinformatics/btn312. - DOI - PubMed
1. Wilson D, Pethica R, Zhou Y, Talbot C, Vogel C, Madera M, Chothia C, Gough J. SUPERFAMILY--sophisticated comparative genomics, data mining, visualization and phylogeny. Nucl Acids Res. 2009;37:D380–386. doi: 10.1093/nar/gkn762. - DOI - PMC - PubMed
1. Marchler-Bauer A, Anderson JB, Chitsaz F, Derbyshire MK, DeWeese-Scott C, Fong JH, Geer LY, Geer RC, Gonzales NR, Gwadz M, He S, Hurwitz DI, Jackson JD, Ke Z, Lanczycki CJ, Liebert CA, Liu C, Lu F, Lu S, Marchler GH, Mullokandov M, Song JS, Tasneem A, Thanki N, Yamashita RA, Zhang D, Zhang N, Bryant SH. CDD: specific functional annotation with the Conserved Domain Database. Nucl Acids Res. 2009;37:D205–210. doi: 10.1093/nar/gkn845. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Using context to improve protein domain identification

Affiliation

Using context to improve protein domain identification

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Molecular Biology Databases