Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Jul 1;23(13):i529-38.
doi: 10.1093/bioinformatics/btm195.

Information theory applied to the sparse gene ontology annotation network to predict novel gene function

Affiliations

Information theory applied to the sparse gene ontology annotation network to predict novel gene function

Ying Tao et al. Bioinformatics. .

Abstract

Motivation: Despite advances in the gene annotation process, the functions of a large portion of gene products remain insufficiently characterized. In addition, the in silico prediction of novel Gene Ontology (GO) annotations for partially characterized gene functions or processes is highly dependent on reverse genetic or functional genomic approaches. To our knowledge, no prediction method has been demonstrated to be highly accurate for sparsely annotated GO terms (those associated to fewer than 10 genes).

Results: We propose a novel approach, information theory-based semantic similarity (ITSS), to automatically predict molecular functions of genes based on existing GO annotations. Using a 10-fold cross-validation, we demonstrate that the ITSS algorithm obtains prediction accuracies (precision 97%, recall 77%) comparable to other machine learning algorithms when compared in similar conditions over densely annotated portions of the GO datasets. This method is able to generate highly accurate predictions in sparsely annotated portions of GO, where previous algorithms have failed. As a result, our technique generates an order of magnitude more functional predictions than previous methods. A 10-fold cross validation demonstrated a precision of 90% at a recall of 36% for the algorithm over sparsely annotated networks of the recent GO annotations (about 1400 GO terms and 11,000 genes in Homo sapiens). To our knowledge, this article presents the first historical rollback validation for the predicted GO annotations, which may represent more realistic conditions than more widely used cross-validation approaches. By manually assessing a random sample of 100 predictions conducted in a historical rollback evaluation, we estimate that a minimum precision of 51% (95% confidence interval: 43-58%) can be achieved for the human GO Annotation file dated 2003.

Availability: The program is available on request. The 97,732 positive predictions of novel gene annotations from the 2005 GO Annotation dataset and other supplementary information is available at http://phenos.bsd.uchicago.edu/ITSS/.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig 1
Fig 1. Semantic similarity between concepts
This figure illustrates the semantic similarity between any two concepts, or any two groups of concepts in an ontology. a) Semantic similarity can be calculated between any two of the nine concepts. b) Semantic similarity can also be calculated between any two arbitrarily defined groups of concepts. Group I contains concepts 1 and 3, Group II is comprised of concepts 4, 5, 7 and 8, and Group III contains concepts 5, 6, 8 and 9. Concepts can be shared between concepts (e.g., concepts 5 and 8 are members of both Group II and Group III).
Fig 2
Fig 2
Determining semantic similarity between groups of concepts using pair-wise method. The small circles represent concepts, and the dashed ovals indicate the groups of concepts. The geometric distances between the circles illustrate the semantic distances between concepts; a larger semantic distance indicates a lower semantic similarity between concepts. a) First, for each concept in Group A, the concept in Group B with the maximum semantic similarity (i.e., shortest distance) is determined. The arrows pointing from Group A to Group B indicate these relations. b) Next, for each concept in Group B, the concept in Group A with the maximum semantic similarity (shortest distance) is determined. The arrows pointing from Group B to Group A indicate these relations. c) Finally, the bidirectional arrows illustrate the resulting reciprocal relations that are returned as pairs of concepts with the maximum semantic similarity. The similarity score sim(A,B) is calculated using Equation 2.
Fig.3
Fig.3
ROC curves of GOAr dataset in 10-fold cross-validation
Fig.4
Fig.4
Maximal F value vs. the parameter t curves of GOAr dataset in 10-fold cross-validation
Fig.5
Fig.5
ROC curves in historical rollback validation
Fig.6
Fig.6
Maximal F value vs. the parameter t curves in historical rollback validation

References

    1. Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens B, De Smet F, Tranchevent LC, De Moor B, Marynen P, Hassan B, Carmeliet P, Moreau Y. Gene prioritization through genomic data fusion. Nat Biotechnol. 2006;24:537–544. - PubMed
    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. - PubMed
    1. Andrade MA, Valencia A. Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatics. 1998;14:600–607. - PubMed
    1. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000;25:25–29. - PMC - PubMed
    1. Bertsch B, Ogden CA, Sidhu K, Le-Niculescu H, Kuczenski R, Niculescu AB. Convergent functional genomics: a Bayesian candidate gene identification approach for complex disorders. Methods. 2005;37:274–279. - PubMed

Publication types