Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008;9 Suppl 1(Suppl 1):S2.
doi: 10.1186/gb-2008-9-s1-s2. Epub 2008 Jun 27.

A critical assessment of Mus musculus gene function prediction using integrated genomic evidence

Affiliations

A critical assessment of Mus musculus gene function prediction using integrated genomic evidence

Lourdes Peña-Castillo et al. Genome Biol. 2008.

Abstract

Background: Several years after sequencing the human genome and the mouse genome, much remains to be discovered about the functions of most human and mouse genes. Computational prediction of gene function promises to help focus limited experimental resources on the most likely hypotheses. Several algorithms using diverse genomic data have been applied to this task in model organisms; however, the performance of such approaches in mammals has not yet been evaluated.

Results: In this study, a standardized collection of mouse functional genomic data was assembled; nine bioinformatics teams used this data set to independently train classifiers and generate predictions of function, as defined by Gene Ontology (GO) terms, for 21,603 mouse genes; and the best performing submissions were combined in a single set of predictions. We identified strengths and weaknesses of current functional genomic data sets and compared the performance of function prediction algorithms. This analysis inferred functions for 76% of mouse genes, including 5,000 currently uncharacterized genes. At a recall rate of 20%, a unified set of predictions averaged 41% precision, with 26% of GO terms achieving a precision better than 90%.

Conclusion: We performed a systematic evaluation of diverse, independently developed computational approaches for predicting gene function from heterogeneous data sources in mammals. The results show that currently available data for mammals allows predictions with both breadth and accuracy. Importantly, many highly novel predictions emerge for the 38% of mouse genes that remain uncharacterized.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Measures of performance for the initial round of GO term predictions. (a) Mean area under the receiver operating characteristic curve (AUC) within each evaluation category, evaluated using the held-out genes. Gene Ontology Biological process (GO-BP), Cellular component (GO-CC), and Molecular function (GO-MF) branches are indicated on the x-axis, grouped by specificity (indicated by the minimum number of genes in the training set associated with each GO term in a given category). Upper case letters associated with the color code correspond to submission identifier. (b) Mean AUC within each evaluation category, evaluated prospectively using newly annotated genes. (c) For each pair of submissions X and Y, we test for difference in AUC value for every GO term (evaluated using held-out genes). Color bars indicate fraction of pairwise comparisons for which X's AUC is significantly higher (blue), not significantly different (beige), and significantly lower (maroon). (d) As (c), except evaluated using the newly annotated genes. (e) The fraction of GO terms exceeding the indicated precision at 20% recall (P20R) value, evaluated using held-out genes. The black line corresponds to the fraction of GO terms for which the 'straw man' approach achieved the indicated precision. (f) As (e), except with P20R values derived prospectively from newly annotated genes.
Figure 2
Figure 2
Measures of performance for the second round of GO term predictions. (a, b) As described in Figure 1a, b, except that the gray color area indicates performance in the first set of submissions. (c-f) As described in Figure 1c-f, except that asterisks in (c) and (d) indicate second-round submissions and dashed lines in (e) and (f) indicate the performance of an earlier submission by the same group. GO, Gene Ontology.
Figure 3
Figure 3
Factors affecting prediction performance. (a) Precision at 20% recall (P20R) values evaluated using held-out annotations on all Gene Ontology (GO) terms (vertical axis) within each of the 12 evaluation categories for each submission (left panel) and for a simple guilt-by association using each data set in turn as its sole evidence source (right panel). The number of genes in each evaluation category is shown in parentheses. GO-BP, GO Biological process; GO-CC, GO Cellular component; GO-MF, GO Molecular function; NB, naïve Bayes. Data sets are described in Table 1. (b) Fraction of the 21,603 genes in the data collection with at least one annotated neighbor per data set. (c) Analysis of variance (ANOVA), exploring the effects of various factors on P20R values. (d) Fraction of total variance in P20R values that is explained by each effect. Asterisks in (c, d) indicate interaction between two factors.
Figure 4
Figure 4
Distribution of GO terms at several precision/recall performance points. Proportion of Gene Ontology (GO) terms per evaluation category with a precision/recall performance point that is both above and to the right of a given precision/recall point in the contour plots. GO-BP, GO Biological process; GO-CC, GO Cellular component; GO-MF, GO Molecular function.
Figure 5
Figure 5
Number of high-precision predictions among GO terms for which precision can be confidently estimated. Number of currently annotated (green) versus predicted genes (orange, predictions expected to be correct; gray, predictions expected to be incorrect) for a subset of Gene Ontology (GO) terms for which 30% precision on held-out annotations was achieved while recovering at least 10 positives in the held-out set. The number of predicted genes displayed was limited to 1,000. GO terms were ordered according to similarity of prediction/annotation patterns. Terminal digits of GO term identifiers are shown in parentheses. GO-BP, GO Biological process; GO-CC, GO Cellular component; GO-MF, GO Molecular function.
Figure 6
Figure 6
Illustration of evidence underlying predictions for the GO term 'Cell adhesion'. As an assessment of predictive usefulness, the precision at 20% recall (P20R) value based on each single data source is shown in parentheses. (a) Expression levels of annotated genes (dark green) and predictions (orange), grouped by Pearson correlation and complete-linkage hierarchical clustering. (b) Protein domains in common among predictions and annotated genes. (c) Largest protein-protein interaction network among predictions and annotated genes. OPHID, Online Predicted Human Interaction Database. (d) Disease and (e) phenotype annotations in common between predictions and annotated genes. Terminal digits of identifiers are shown in parentheses. OMIM, Online Mendelian Inheritance in Man.
Figure 7
Figure 7
Illustration of evidence underlying predictions for the GO term 'Mitochondrial part'. (a-e) As described in Figure 6a-e. GO, Gene Ontology.

References

    1. Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens B, De Smet F, Tranchevent LC, De Moor B, Marynen P, Hassan B, Carmeliet P, Moreau Y. Gene prioritization through genomic data fusion. Nat Biotechnol. 2006;24:537–544. - PubMed
    1. Chen Y, Xu D. Global protein function annotation through mining genome-scale data in yeast Saccharomyces cerevisiae. Nucleic Acids Res. 2004;32:6414–6424. - PMC - PubMed
    1. Joshi T, Chen Y, Becker JM, Alexandrov N, Xu D. Genome-scale gene function prediction using multiple sources of high-throughput data in yeast Saccharomyces cerevisiae. OMICS. 2004;8:322–333. - PubMed
    1. Karaoz U, Murali TM, Letovsky S, Zheng Y, Ding C, Cantor CR, Kasif S. Whole-genome annotation by using evidence integration in functional-linkage networks. Proc Natl Acad Sci USA. 2004;101:2888–2893. - PMC - PubMed
    1. Lanckriet GR, De Bie T, Cristianini N, Jordan MI, Noble WS. A statistical framework for genomic data fusion. Bioinformatics. 2004;20:2626–2635. - PubMed

Publication types