Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jun 17;20(1):338.
doi: 10.1186/s12859-019-2875-5.

Combining learning and constraints for genome-wide protein annotation

Affiliations

Combining learning and constraints for genome-wide protein annotation

Stefano Teso et al. BMC Bioinformatics. .

Abstract

Background: The advent of high-throughput experimental techniques paved the way to genome-wide computational analysis and predictive annotation studies. When considering the joint annotation of a large set of related entities, like all proteins of a certain genome, many candidate annotations could be inconsistent, or very unlikely, given the existing knowledge. A sound predictive framework capable of accounting for this type of constraints in making predictions could substantially contribute to the quality of machine-generated annotations at a genomic scale.

Results: We present OCELOT, a predictive pipeline which simultaneously addresses functional and interaction annotation of all proteins of a given genome. The system combines sequence-based predictors for functional and protein-protein interaction (PPI) prediction with a consistency layer enforcing (soft) constraints as fuzzy logic rules. The enforced rules represent the available prior knowledge about the classification task, including taxonomic constraints over each GO hierarchy (e.g. a protein labeled with a GO term should also be labeled with all ancestor terms) as well as rules combining interaction and function prediction. An extensive experimental evaluation on the Yeast genome shows that the integration of prior knowledge via rules substantially improves the quality of the predictions. The system largely outperforms GoFDR, the only high-ranking system at the last CAFA challenge with a readily available implementation, when GoFDR is given access to intra-genome information only (as OCELOT), and has comparable or better results (depending on the hierarchy and performance measure) when GoFDR is allowed to use information from other genomes. Our system also compares favorably to recent methods based on deep learning.

Keywords: Genome annotation; Kernel methods; Protein function prediction; Protein-protein interaction.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Depiction of the Ocelot decision making process. Above: predicted protein–protein interaction network, circles are proteins and lines represent physical interactions. Below: GO taxonomy, boxes are terms and arrows are IsA relations. Predicted annotations for proteins p1 and p2 (black): p1 is annotated with terms f1,f4,f5 and p2 with f2,f4. The functional predictions are driven by the similarity between p1 and p2, and by consistency with respect to the GO taxonomy (e.g. f1 entails either f3 or f4,f2 entails f4, etc.). The interaction predictions are driven by similarity between protein pairs (i.e. (p1,p2) against all other pairs) and are mutually constrained by the functional ones. For instance, since p1 and p2 do interact, OCELOT aims at predicting at least one shared term at each level of the GO, e.g. f4 at the middle level. These constraints are not hard, and can be violated if doing so provides a better joint prediction. As an example, p1 is annotated with f1 and p2 with f2. Please see the text for the details
Fig. 2
Fig. 2
Overall performance of all prediction methods on the Yeast dataset. Best viewed in color
Fig. 3
Fig. 3
Breakdown of the performance of all methods at different GO term depth. Because GoFDRyeast and GoFDRU90 predicted no labels for level 6 of cellular component, no metric is reported for the specific depth level. Best viewed in color
Fig. 4
Fig. 4
Overall performance of DeepGO, OCELOT, GoFDR and the baseline on the Yeast dataset. Best viewed in color
Fig. 5
Fig. 5
Overall performance of all prediction methods on the Yeast dataset filtered from remote homologies (sequence identity <25%). Best viewed in color

Similar articles

References

    1. Friedberg I. Automated protein function prediction–the genomic challenge. Brief Bioinform. 2006;7(3):225–42. - PubMed
    1. Ashburner M, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, et al. Gene ontology: tool for the unification of biology. the gene ontology consortium. Nat Genet. 2000;25(1):25–9. - PMC - PubMed
    1. Jiang Y, Oron TR, Clark WT, Bankapur AR, D’Andrea D, Lepore R, Funk CS, Kahanda I, Verspoor KM, Ben-Hur A, et al. An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol. 2016;17(1):184. - PMC - PubMed
    1. Keskin O, Gursoy A, Ma B, Nussinov R, et al. Principles of protein-protein interactions: what are the preferred ways for proteins to interact? Chem Rev. 2008;108(4):1225–44. - PubMed
    1. Hopkins AL. Network pharmacology: the next paradigm in drug discovery. Nat Chem Biol. 2008;4(11):682–90. - PubMed

LinkOut - more resources