Comparative Study

. 2010 Nov 2;11 Suppl 2(Suppl 2):S11.

doi: 10.1186/1471-2164-11-S2-S11.

Predicting gene function using few positive examples and unlabeled ones

Yiming Chen¹, Zhoujun Li, Xiaofeng Wang, Jiali Feng, Xiaohua Hu

Affiliations

PMID: 21047378
PMCID: PMC2975410
DOI: 10.1186/1471-2164-11-S2-S11

Comparative Study

Predicting gene function using few positive examples and unlabeled ones

Yiming Chen et al. BMC Genomics. 2010.

. 2010 Nov 2;11 Suppl 2(Suppl 2):S11.

doi: 10.1186/1471-2164-11-S2-S11.

Authors

Yiming Chen¹, Zhoujun Li, Xiaofeng Wang, Jiali Feng, Xiaohua Hu

Affiliation

¹ Computer School of National University of Defense Technology,Changsha,Hunan, China. nudtchenym@gmail.com

PMID: 21047378
PMCID: PMC2975410
DOI: 10.1186/1471-2164-11-S2-S11

Abstract

Background: A large amount of functional genomic data have provided enough knowledge in predicting gene function computationally, which uses known functional annotations and relationship between unknown genes and known ones to map unknown genes to GO functional terms. The prediction procedure is usually formulated as binary classification problem. Training binary classifier needs both positive examples and negative ones that have almost the same size. However, from various annotation database, we can only obtain few positive genes annotation for most of functional terms, that is, there are only few positive examples for training classifier, which makes predicting directly gene function infeasible.

Results: We propose a novel approach SPE_RNE to train classifier for each functional term. Firstly, positive examples set is enlarged by creating synthetic positive examples. Secondly, representative negative examples are selected by training SVM (support vector machine) iteratively to move classification hyperplane to a appropriate place. Lastly, an optimal SVM classifier are trained by using grid search technique. On combined kernel of Yeast protein sequence, microarray expression, protein-protein interaction and GO functional annotation data, we compare SPE_RNE with other three typical methods in three classical performance measures recall R, precise P and their combination F: twoclass considers all unlabeled genes as negative examples, twoclassbal selects randomly same number negative examples from unlabeled gene, PSoL selects a negative examples set that are far from positive examples and far from each other.

Conclusions: In test data and unknown genes data, we compute average and variant of measure F. The experiments show that our approach has better generalized performance and practical prediction capacity. In addition, our method can also be used for other organisms such as human.

PubMed Disclaimer

Figures

**Fig. 1**
**The average number of correctly predicted genes according to GO association released in December 2008 for unknown gene in April 2007** In figure 1, the height of bar denotes the average number of genes predicted correctly by four algorithms and average true number of genes on different groups.

**Fig. 2**
**One-class SVM extracts initial negative example set** In figure 2, plus signs, plus signs with circle and circles denote positive examples, potential positive examples and unlabeled examples respectively. The points covered by ellipse are negative examples set N₀ and the line is classification hyperplane. One-class SVM is utilized to extract the initial negative examples. Give a percentage of negative examples, such as 10 percent, it can draw an initial decision boundary to cover most of the positive and unlabeled data. The data points not covered by the decision boundary can be regarded as negative data points because these data points are far from the major positive set.

**Fig. 3**
**The first iteration in which the negative example set N₁ is obtained by moving the classification hyperplane towards positive example set** In figure 3, plus signs, plus signs with circle and circles denote positive examples, potential positive examples and unlabeled examples respectively. The points covered by ellipse are negative examples set N₁ and the line is classification hyperplane. With the positive examples and initial negative example set N₀,the SVM classifier C₀ is learned, the negative example set N₁ consists of the support vectors of C₀ and the unlabeled examples predicted as negative examples.

formula image — **Fig. 3**
**The first iteration in which the negative example set N₁ is obtained by moving the classification hyperplane towards positive example set** In figure 3, plus signs, plus signs with circle and circles denote positive examples, potential positive examples and unlabeled examples respectively. The points covered by ellipse are negative examples set N₁ and the line is classification hyperplane. With the positive examples and initial negative example set N₀,the SVM classifier C₀ is learned, the negative example set N₁ consists of the support vectors of C₀ and the unlabeled examples predicted as negative examples.

**Fig. 4**
**The negative example set N₂ is obtained with the second iteration** In figure 4, plus signs, plus signs with circle and circles denote positive examples, potential positive examples and unlabeled examples respectively. The points covered by ellipse are negative example set N₂ and the line is classification hyperplane.With the new training set ∪ N₁, the SVM classifier C₁ is learned, the negative example set N₂ consists of the support vectors of C₁ and the unlabeled examples predicted as negative examples.

**Fig. 5**
**Obtaining the negative example set N₃ with the more iteration** In figure 5, plus signs, plus signs with circle and circles denote positive examples, potential positive examples and unlabeled examples respectively. The points covered by ellipse are negative example set N₃ and the line is classification hyperplane.With the new negative example set N₂, the SVM classifier C₂ is learned, the negative example set N₃ consists of the support vectors of C₂> and the unlabeled examples predicted as negative examples. The process can proceed until the condition |U| ≤ 4 * || is meet.

See this image and copyright information in PMC

References

1. Ashburner M, Ball C, Blake J, Botstein D. Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. 2000;25:25–29. doi: 10.1038/75556. - DOI - PMC - PubMed
1. Zhou X, Kao M, Wong W. From the Cover: Transitive function annotation by shortest-path analysis of gene expression data. Proc. 2002;99(20):12783–12788. doi: 10.1073/pnas.192159399. - DOI - PMC - PubMed
1. Chua H, Sung W, Wong I. Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics. 2006;22(13):1623–1630. doi: 10.1093/bioinformatics/btl145. - DOI - PubMed
1. Lanckriet G, Deng M, Cristianini M, Jordan M, Noble W. Kernel-based data fusion and its application to protein function prediction in yeast. In Bioinformatics, Pac Symp Biocomput. 2004. pp. 300–11. - PubMed
1. Barutcuoglu Z, Schapire RE, Troyanskaya OG. Hierarchical multi-label prediction of gene function. Bioinformatics. 2006;22(7):830–836. doi: 10.1093/bioinformatics/btk048. - DOI - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- Saccharomyces Genome Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Predicting gene function using few positive examples and unlabeled ones

Affiliation

Predicting gene function using few positive examples and unlabeled ones

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Molecular Biology Databases