Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Jan 30:15:195-211.
doi: 10.1016/j.csbj.2017.01.009. eCollection 2017.

The effects of shared information on semantic calculations in the gene ontology

Affiliations

The effects of shared information on semantic calculations in the gene ontology

Paul W Bible et al. Comput Struct Biotechnol J. .

Abstract

The structured vocabulary that describes gene function, the gene ontology (GO), serves as a powerful tool in biological research. One application of GO in computational biology calculates semantic similarity between two concepts to make inferences about the functional similarity of genes. A class of term similarity algorithms explicitly calculates the shared information (SI) between concepts then substitutes this calculation into traditional term similarity measures such as Resnik, Lin, and Jiang-Conrath. Alternative SI approaches, when combined with ontology choice and term similarity type, lead to many gene-to-gene similarity measures. No thorough investigation has been made into the behavior, complexity, and performance of semantic methods derived from distinct SI approaches. We apply bootstrapping to compare the generalized performance of 57 gene-to-gene semantic measures across six benchmarks. Considering the number of measures, we additionally evaluate whether these methods can be leveraged through ensemble machine learning to improve prediction performance. Results showed that the choice of ontology type most strongly influenced performance across all evaluations. Combining measures into an ensemble classifier reduces cross-validation error beyond any individual measure for protein interaction prediction. This improvement resulted from information gained through the combination of ontology types as ensemble methods within each GO type offered no improvement. These results demonstrate that multiple SI measures can be leveraged for machine learning tasks such as automated gene function prediction by incorporating methods from across the ontologies. To facilitate future research in this area, we developed the GO Graph Tool Kit (GGTK), an open source C++ library with Python interface (github.com/paulbible/ggtk).

Keywords: Function prediction; Gene expression; Gene ontology; Machine learning; Protein–protein interaction; Semantic similarity.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Changes in GO graph structure (using is_a and part_of relationships) over time lead to variations in (A) number of nodes, (B) mean ancestor number, and (C) the mean branching factor for each term. Panel D shows the log of the number of terms minus the mean ancestor number.
Fig. 2
Fig. 2
The execution times for the all-pairs term similarity for Resnik term similarity show that GraSM and A-GraSM methods are slower than other shared information methods.
Fig. 3
Fig. 3
Performance distributions for the BLAST-based RRBS benchmark for all 57 measures organized by term similarity algorithm and ontology type.
Fig. 4
Fig. 4
Performance distributions for the Pfam Jaccard benchmark for all 57 measures organized by term similarity algorithm and ontology type.
Fig. 5
Fig. 5
Performance distributions for the Pfam TF–IDF benchmark for all 57 measures organized by term similarity algorithm and ontology type.
Fig. 6
Fig. 6
Performance distributions for absolute gene expression correlation (Pearson) against all 57 measures organized by term similarity algorithm and ontology type.
Fig. 7
Fig. 7
Performance distributions for Reactome clustering compared to the 57 gene similarity semantic measures organized by term similarity algorithm and ontology type.
Fig. 8
Fig. 8
Performance distributions for protein–protein interaction prediction by area under the ROC curve for the 57 gene similarity semantic measures organized by term similarity algorithm and ontology type.
Fig. 9
Fig. 9
The relative influence of ontology type, term similarity method, and shared information type (SI) on the mean performance across six evaluations.
Fig. 10
Fig. 10
The percent of misclassified samples for each method under study, trained as classifiers, and four voting predictors evaluated by 10-fold cross-validation. The voting predictor for an ontology type is presented as the last classifier within that ontology (red), and the voting predictor utilizing all semantic methods is presented at the far right.
None
None
None

References

    1. Ashburner M., Ball C.A., Blake J.A., Botstein D., Butler H., Cherry J.M. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000;25(1):25–29. - PMC - PubMed
    1. Consortium GO Creating the gene ontology resource: design and implementation. Genome Res. 2001;11(8):1425–1433. - PMC - PubMed
    1. Guo X., Liu R., Shriver C.D., Hu H., Liebman M.N. Assessing semantic similarity measures for the characterization of human regulatory pathways. Bioinformatics. 2006;22(8):967–973. - PubMed
    1. Jain S., Bader G.D. An improved method for scoring protein–protein interactions using semantic similarity within the gene ontology. BMC Bioinf. 2010;11(1):1. - PMC - PubMed
    1. Wang J.Z., Du Z., Payattakool R., Philip S.Y., Chen C.-F. A new method to measure the semantic similarity of Go terms. Bioinformatics. 2007;23(10):1274–1281. - PubMed

LinkOut - more resources