. 2019 Jun 17;20(1):338.

doi: 10.1186/s12859-019-2875-5.

Combining learning and constraints for genome-wide protein annotation

Stefano Teso¹, Luca Masera², Michelangelo Diligenti³, Andrea Passerini⁴

Affiliations

¹ Computer Science Department, KULeuven, Celestijnenlaan 200 A bus 2402, Leuven, 3001, Belgium.
² Department of Information Engineering and Computer Science, University of Trento, Via Sommarive, 5, Povo di Trento, 38123, Italy.
³ Department of Information Engineering and Mathematics, University of Siena, San Niccolò, via Roma, 56, Siena, 53100, Italy.
⁴ Department of Information Engineering and Computer Science, University of Trento, Via Sommarive, 5, Povo di Trento, 38123, Italy. andrea.passerini@unitn.it.

PMID: 31208327
PMCID: PMC6580517
DOI: 10.1186/s12859-019-2875-5

Combining learning and constraints for genome-wide protein annotation

Stefano Teso et al. BMC Bioinformatics. 2019.

. 2019 Jun 17;20(1):338.

doi: 10.1186/s12859-019-2875-5.

Authors

Stefano Teso¹, Luca Masera², Michelangelo Diligenti³, Andrea Passerini⁴

Affiliations

¹ Computer Science Department, KULeuven, Celestijnenlaan 200 A bus 2402, Leuven, 3001, Belgium.
² Department of Information Engineering and Computer Science, University of Trento, Via Sommarive, 5, Povo di Trento, 38123, Italy.
³ Department of Information Engineering and Mathematics, University of Siena, San Niccolò, via Roma, 56, Siena, 53100, Italy.
⁴ Department of Information Engineering and Computer Science, University of Trento, Via Sommarive, 5, Povo di Trento, 38123, Italy. andrea.passerini@unitn.it.

PMID: 31208327
PMCID: PMC6580517
DOI: 10.1186/s12859-019-2875-5

Abstract

Background: The advent of high-throughput experimental techniques paved the way to genome-wide computational analysis and predictive annotation studies. When considering the joint annotation of a large set of related entities, like all proteins of a certain genome, many candidate annotations could be inconsistent, or very unlikely, given the existing knowledge. A sound predictive framework capable of accounting for this type of constraints in making predictions could substantially contribute to the quality of machine-generated annotations at a genomic scale.

Results: We present OCELOT, a predictive pipeline which simultaneously addresses functional and interaction annotation of all proteins of a given genome. The system combines sequence-based predictors for functional and protein-protein interaction (PPI) prediction with a consistency layer enforcing (soft) constraints as fuzzy logic rules. The enforced rules represent the available prior knowledge about the classification task, including taxonomic constraints over each GO hierarchy (e.g. a protein labeled with a GO term should also be labeled with all ancestor terms) as well as rules combining interaction and function prediction. An extensive experimental evaluation on the Yeast genome shows that the integration of prior knowledge via rules substantially improves the quality of the predictions. The system largely outperforms GoFDR, the only high-ranking system at the last CAFA challenge with a readily available implementation, when GoFDR is given access to intra-genome information only (as OCELOT), and has comparable or better results (depending on the hierarchy and performance measure) when GoFDR is allowed to use information from other genomes. Our system also compares favorably to recent methods based on deep learning.

Keywords: Genome annotation; Kernel methods; Protein function prediction; Protein-protein interaction.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
Depiction of the Ocelot decision making process. Above: predicted protein–protein interaction network, circles are proteins and lines represent physical interactions. Below: GO taxonomy, boxes are terms and arrows are IsA relations. Predicted annotations for proteins p₁ and p₂ (black): p₁ is annotated with terms f₁,f₄,f₅ and p₂ with f₂,f₄. The functional predictions are driven by the similarity between p₁ and p₂, and by consistency with respect to the GO taxonomy (e.g. f₁ entails either f₃ or f₄,f₂ entails f₄, etc.). The interaction predictions are driven by similarity between protein pairs (i.e. (p₁,p₂) against all other pairs) and are mutually constrained by the functional ones. For instance, since p₁ and p₂ do interact, OCELOT aims at predicting at least one shared term at each level of the GO, e.g. f₄ at the middle level. These constraints are not hard, and can be violated if doing so provides a better joint prediction. As an example, p₁ is annotated with f₁ and p₂ with f₂. Please see the text for the details

**Fig. 2**
Overall performance of all prediction methods on the Yeast dataset. Best viewed in color

**Fig. 3**
Breakdown of the performance of all methods at different GO term depth. Because GoFDR_yeast and GoFDR_U90 predicted no labels for level 6 of cellular component, no metric is reported for the specific depth level. Best viewed in color

**Fig. 4**
Overall performance of DeepGO, OCELOT, GoFDR and the baseline on the Yeast dataset. Best viewed in color

**Fig. 5**
Overall performance of all prediction methods on the Yeast dataset filtered from remote homologies (sequence identity <25%). Best viewed in color

See this image and copyright information in PMC

References

1. Friedberg I. Automated protein function prediction–the genomic challenge. Brief Bioinform. 2006;7(3):225–42. - PubMed
1. Ashburner M, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, et al. Gene ontology: tool for the unification of biology. the gene ontology consortium. Nat Genet. 2000;25(1):25–9. - PMC - PubMed
1. Jiang Y, Oron TR, Clark WT, Bankapur AR, D’Andrea D, Lepore R, Funk CS, Kahanda I, Verspoor KM, Ben-Hur A, et al. An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol. 2016;17(1):184. - PMC - PubMed
1. Keskin O, Gursoy A, Ma B, Nussinov R, et al. Principles of protein-protein interactions: what are the preferred ways for proteins to interact? Chem Rev. 2008;108(4):1225–44. - PubMed
1. Hopkins AL. Network pharmacology: the next paradigm in drug discovery. Nat Chem Biol. 2008;4(11):682–90. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Combining learning and constraints for genome-wide protein annotation

Affiliations

Combining learning and constraints for genome-wide protein annotation

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Molecular Biology Databases

Abstract

Conflict of interest statement

Figures

Similar articles

References

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Molecular Biology Databases