Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2010 Nov 2;11 Suppl 2(Suppl 2):S11.
doi: 10.1186/1471-2164-11-S2-S11.

Predicting gene function using few positive examples and unlabeled ones

Affiliations
Comparative Study

Predicting gene function using few positive examples and unlabeled ones

Yiming Chen et al. BMC Genomics. .

Abstract

Background: A large amount of functional genomic data have provided enough knowledge in predicting gene function computationally, which uses known functional annotations and relationship between unknown genes and known ones to map unknown genes to GO functional terms. The prediction procedure is usually formulated as binary classification problem. Training binary classifier needs both positive examples and negative ones that have almost the same size. However, from various annotation database, we can only obtain few positive genes annotation for most of functional terms, that is, there are only few positive examples for training classifier, which makes predicting directly gene function infeasible.

Results: We propose a novel approach SPE_RNE to train classifier for each functional term. Firstly, positive examples set is enlarged by creating synthetic positive examples. Secondly, representative negative examples are selected by training SVM (support vector machine) iteratively to move classification hyperplane to a appropriate place. Lastly, an optimal SVM classifier are trained by using grid search technique. On combined kernel of Yeast protein sequence, microarray expression, protein-protein interaction and GO functional annotation data, we compare SPE_RNE with other three typical methods in three classical performance measures recall R, precise P and their combination F: twoclass considers all unlabeled genes as negative examples, twoclassbal selects randomly same number negative examples from unlabeled gene, PSoL selects a negative examples set that are far from positive examples and far from each other.

Conclusions: In test data and unknown genes data, we compute average and variant of measure F. The experiments show that our approach has better generalized performance and practical prediction capacity. In addition, our method can also be used for other organisms such as human.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
The average number of correctly predicted genes according to GO association released in December 2008 for unknown gene in April 2007 In figure 1, the height of bar denotes the average number of genes predicted correctly by four algorithms and average true number of genes on different groups.
Fig. 2
Fig. 2
One-class SVM extracts initial negative example set In figure 2, plus signs, plus signs with circle and circles denote positive examples, potential positive examples and unlabeled examples respectively. The points covered by ellipse are negative examples set N0 and the line is classification hyperplane. One-class SVM is utilized to extract the initial negative examples. Give a percentage of negative examples, such as 10 percent, it can draw an initial decision boundary to cover most of the positive and unlabeled data. The data points not covered by the decision boundary can be regarded as negative data points because these data points are far from the major positive set.
Fig. 3
Fig. 3
The first iteration in which the negative example set N1 is obtained by moving the classification hyperplane towards positive example set In figure 3, plus signs, plus signs with circle and circles denote positive examples, potential positive examples and unlabeled examples respectively. The points covered by ellipse are negative examples set N1 and the line is classification hyperplane. With the positive examples formula image and initial negative example set N0,the SVM classifier C0 is learned, the negative example set N1 consists of the support vectors of C0 and the unlabeled examples predicted as negative examples.
Fig. 4
Fig. 4
The negative example set N2 is obtained with the second iteration In figure 4, plus signs, plus signs with circle and circles denote positive examples, potential positive examples and unlabeled examples respectively. The points covered by ellipse are negative example set N2 and the line is classification hyperplane.With the new training set formula imageN1, the SVM classifier C1 is learned, the negative example set N2 consists of the support vectors of C1 and the unlabeled examples predicted as negative examples.
Fig. 5
Fig. 5
Obtaining the negative example set N3 with the more iteration In figure 5, plus signs, plus signs with circle and circles denote positive examples, potential positive examples and unlabeled examples respectively. The points covered by ellipse are negative example set N3 and the line is classification hyperplane.With the new negative example set N2, the SVM classifier C2 is learned, the negative example set N3 consists of the support vectors of C2> and the unlabeled examples predicted as negative examples. The process can proceed until the condition |U| ≤ 4 * |formula image| is meet.

References

    1. Ashburner M, Ball C, Blake J, Botstein D. Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. 2000;25:25–29. doi: 10.1038/75556. - DOI - PMC - PubMed
    1. Zhou X, Kao M, Wong W. From the Cover: Transitive function annotation by shortest-path analysis of gene expression data. Proc. 2002;99(20):12783–12788. doi: 10.1073/pnas.192159399. - DOI - PMC - PubMed
    1. Chua H, Sung W, Wong I. Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics. 2006;22(13):1623–1630. doi: 10.1093/bioinformatics/btl145. - DOI - PubMed
    1. Lanckriet G, Deng M, Cristianini M, Jordan M, Noble W. Kernel-based data fusion and its application to protein function prediction in yeast. In Bioinformatics, Pac Symp Biocomput. 2004. pp. 300–11. - PubMed
    1. Barutcuoglu Z, Schapire RE, Troyanskaya OG. Hierarchical multi-label prediction of gene function. Bioinformatics. 2006;22(7):830–836. doi: 10.1093/bioinformatics/btk048. - DOI - PubMed

Publication types

LinkOut - more resources