Learning gene regulatory networks from only positive and unlabeled data

doi:10.1186/1471-2105-11-228

Comparative Study

. 2010 May 5:11:228.

doi: 10.1186/1471-2105-11-228.

Learning gene regulatory networks from only positive and unlabeled data

Luigi Cerulo¹, Charles Elkan, Michele Ceccarelli

Affiliations

PMID: 20444264
PMCID: PMC2887423
DOI: 10.1186/1471-2105-11-228

Comparative Study

Learning gene regulatory networks from only positive and unlabeled data

Luigi Cerulo et al. BMC Bioinformatics. 2010.

. 2010 May 5:11:228.

doi: 10.1186/1471-2105-11-228.

Authors

Luigi Cerulo¹, Charles Elkan, Michele Ceccarelli

Affiliation

¹ Department of Biological and Environmental Studies, University of Sannio, Benevento, Italy. lcerulo@unisannio.it

PMID: 20444264
PMCID: PMC2887423
DOI: 10.1186/1471-2105-11-228

Abstract

Background: Recently, supervised learning methods have been exploited to reconstruct gene regulatory networks from gene expression data. The reconstruction of a network is modeled as a binary classification problem for each pair of genes. A statistical classifier is trained to recognize the relationships between the activation profiles of gene pairs. This approach has been proven to outperform previous unsupervised methods. However, the supervised approach raises open questions. In particular, although known regulatory connections can safely be assumed to be positive training examples, obtaining negative examples is not straightforward, because definite knowledge is typically not available that a given pair of genes do not interact.

Results: A recent advance in research on data mining is a method capable of learning a classifier from only positive and unlabeled examples, that does not need labeled negative examples. Applied to the reconstruction of gene regulatory networks, we show that this method significantly outperforms the current state of the art of machine learning methods. We assess the new method using both simulated and experimental data, and obtain major performance improvement.

Conclusions: Compared to unsupervised methods for gene network inference, supervised methods are potentially more accurate, but for training they need a complete set of known regulatory connections. A supervised method that can be trained using only positive and unlabeled data, as presented in this paper, is especially beneficial for the task of inferring gene regulatory networks, because only an incomplete set of known regulatory connections is available in public databases such as RegulonDB, TRRD, KEGG, Transfac, and IPA.

PubMed Disclaimer

Figures

**Figure 1**
**Supervised vs unsupervised approaches in the identification of gene-gene interactions**. The figure depicts the two main perspectives followed by supervised and unsupervised methods in the inference of gene regulatory networks. Both induce the interaction model from genomic data (e.g. microarray experiments) but supervised methods need also a set of prior known interactions.

**Figure 3**
**Distribution of positives in the simulated datasets**. The figure shows the distribution of the number of positives in the set of random gene networks generated with the *GeneNetWeaver* tool.

**Figure 4**
**An example of E. coli 50 gene sub-network generated with the GeneNetWeaver tool**. The figure shows a typical gene-gene interaction network generated with the *GeneNetWeaver* tool. The network is extracted randomly from the complete network of *E. coli* published in RegulonDB.

**Figure 5**
**Interval plots of F-Measure (95% CI of the mean) of PosOnly, PSEUDO-RANDOM, and SVMOnly classifiers on simulated Escherichia coli and Saccharomyces cerevisiae data**. The figure shows the interval plots with 95% confidence interval of *PosOnly*, *PSEUDO-RANDOM*, and *SVMOnly* F-Measure mean obtained on simulated data at different percentage of known positives. All algorithms exhibit a progressively increment in performance when the number of known positive examples grows from P = 10% to P = 100% reaching an almost convergent value at P = 100%.

**Figure 6**
**Average difference between PosOnly, PSEUDO-RANDOM, and SVMOnly classifiers on simulated Escherichia coli and Saccharomyces cerevisiae data**. The figure shows the difference between *PosOnly*, *PSEUDO-RANDOM*, and *SVMOnly* F-Measure means at different percentage of known positives and with different network sizes. Such a difference varies with the number of genes and the maximum in the range around P = 40% and P = 60%.

**Figure 7**
**Average Precision of PosOnly, PSEUDO-RANDOM, and SVMOnly classifiers on experimental data**. The figure shows the interval plots with 95% confidence interval of *PosOnly*, *PSEUDO-RANDOM*, and *SVMOnly* Precision mean values obtained on experimental data at different percentage of known positives.

**Figure 8**
**Average Recall of PosOnly, PSEUDO-RANDOM, and SVMOnly classifiers on experimental data**. The figure shows the interval plots with 95% confidence interval of *PosOnly*, *PSEUDO-RANDOM*, and *SVMOnly* Recall mean values obtained on experimental data at different percentage of known positives.

**Figure 9**
**Average F-Measure of PosOnly, PSEUDO-RANDOM, and SVMOnly classifiers on experimental data**. The figure shows the interval plots with 95% confidence interval of *PosOnly*, *PSEUDO-RANDOM*, and *SVMOnly* F-Measure mean obtained on experimental data at different percentage of known positives. Both algorithms exhibit a progressively increment in performance when the number of known positive examples grows from P = 10% to P = 100% reaching an almost convergent value at P = 100%.

**Figure 10**
**Average F-Measure of PosOnly, PSEUDO-RANDOM, and SVMOnly on simulated data and at different network sizes**. The figure shows the performance of *PosOnly*, *PSEUDO-RANDOM*, and *SVMOnly* obtained with network of different sizes. Each approaches exhibit similar behavior when the number of genes increases: the average performance of the classifier increases when the percentages of known positives is low, while decreases when the percentages of known positives is high.

**Figure 11**
**Comparison with unsupervised methods, ARACNE and CLR in simulated data. Average F-Measure at different percentage of known positives**. The figure shows the difference between supervised and unsupervised methods obtained in the context of simulated data. The performance of supervised methods increases with the percentage of known positive examples. Instead, the performance of unsupervised information theoretic methods decreases with the number of genes in a regulatory network and is of course independent from the percentage of known positive examples. The intersection between supervised and unsupervised curves occur at different percentage of known positives and decreases with the number of genes composing the network.

**Figure 12**
**Comparison with unsupervised methods, ARACNE and CLR in simulated data. Average AUROC at different percentage of known positives**. The figure shows the difference between supervised and unsupervised methods performance in terms of AUROC (Area Under the ROC curve) obtained in the context of simulated data. The AUROC of both *PosOnly* and *SVMOnly* is the same as the order of data predicted by each method is the same. It can be noticed that, similarly for F-Measure, the performance in term of AUROC of supervised methods increases with the percentage of known positive examples. Instead, the performance of unsupervised information theoretic methods are almost the same explaining the fact that unsupervised methods are able to select very precise top regulations but are unable to uncover (by means of a threshold) the complete set of gene regulations of a network.

**Figure 13**
**Comparison with unsupervised methods, ARACNE and CLR in experimental data. Average F-Measure at different percentage of known positives**. The figure shows the difference between supervised and unsupervised methods obtained in the context of experimental data. The performance of supervised methods increases with the percentage of known positive examples. Instead, the performance of unsupervised methods is independent from the percentage of known positive examples.

See this image and copyright information in PMC

Cited by

Gene network landscape of the ciliate Tetrahymena thermophila.
Xiong J, Yuan D, Fillingham JS, Garg J, Lu X, Chang Y, Liu Y, Fu C, Pearlman RE, Miao W. Xiong J, et al. PLoS One. 2011;6(5):e20124. doi: 10.1371/journal.pone.0020124. Epub 2011 May 26. PLoS One. 2011. PMID: 21637855 Free PMC article.
A negative selection heuristic to predict new transcriptional targets.
Cerulo L, Paduano V, Zoppoli P, Ceccarelli M. Cerulo L, et al. BMC Bioinformatics. 2013;14 Suppl 1(Suppl 1):S3. doi: 10.1186/1471-2105-14-S1-S3. Epub 2013 Jan 14. BMC Bioinformatics. 2013. PMID: 23368951 Free PMC article.
Identification of a novel gene signature of ES cells self-renewal fluctuation through system-wide analysis.
Cerulo L, Tagliaferri D, Marotta P, Zoppoli P, Russo F, Mazio C, DeFelice M, Ceccarelli M, Falco G. Cerulo L, et al. PLoS One. 2014 Jan 2;9(1):e83235. doi: 10.1371/journal.pone.0083235. eCollection 2014. PLoS One. 2014. PMID: 24392082 Free PMC article.
Inference of time-delayed gene regulatory networks based on dynamic Bayesian network hybrid learning method.
Yu B, Xu JM, Li S, Chen C, Chen RX, Wang L, Zhang Y, Wang MH. Yu B, et al. Oncotarget. 2017 Sep 23;8(46):80373-80392. doi: 10.18632/oncotarget.21268. eCollection 2017 Oct 6. Oncotarget. 2017. PMID: 29113310 Free PMC article.
Single_cell_GRN: gene regulatory network identification based on supervised learning method and Single-cell RNA-seq data.
Yang B, Bao W, Chen B, Song D. Yang B, et al. BioData Min. 2022 Jun 11;15(1):13. doi: 10.1186/s13040-022-00297-8. BioData Min. 2022. PMID: 35690842 Free PMC article.

See all "Cited by" articles

References

1. Hecker M, Lambeck S, Toepfer S, van Someren E, Guthke R. Gene regulatory network inference: Data integration in dynamic models-A review. Bio Systems. 2008;96(1):86–103. - PubMed
1. Zoppoli P, Morganella S, Ceccarelli M. TimeDelay-ARACNE: Reverse engineering of gene networks from time-course data by an information theoretic approach. BMC Bioinformatics. 2010;11:154. doi: 10.1186/1471-2105-11-154. - DOI - PMC - PubMed
1. Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Dalla Favera R, Califano A. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics. 2006;7(Suppl 1):S7. doi: 10.1186/1471-2105-7-S1-S7. - DOI - PMC - PubMed
1. Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, Kasif S, Collins JJ, Gardner TS. Large-Scale Mapping and Validation of Escherichia coli Transcriptional Regulation from a Compendium of Expression Profiles. PLoS Biol. 2007;5:e8. doi: 10.1371/journal.pbio.0050008. - DOI - PMC - PubMed
1. Liang S, Fuhrman S, Somogyi R. Reveal, a general reverse engineering algorithm for inference of genetic network architectures. Pac Symp Biocomput. 1998. pp. 18–29. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

[1] Hecker M, Lambeck S, Toepfer S, van Someren E, Guthke R. Gene regulatory network inference: Data integration in dynamic models-A review. Bio Systems. 2008;96(1):86–103. - PubMed

[2] Hecker M, Lambeck S, Toepfer S, van Someren E, Guthke R. Gene regulatory network inference: Data integration in dynamic models-A review. Bio Systems. 2008;96(1):86–103. - PubMed

[3] Zoppoli P, Morganella S, Ceccarelli M. TimeDelay-ARACNE: Reverse engineering of gene networks from time-course data by an information theoretic approach. BMC Bioinformatics. 2010;11:154. doi: 10.1186/1471-2105-11-154. - DOI - PMC - PubMed

[4] Zoppoli P, Morganella S, Ceccarelli M. TimeDelay-ARACNE: Reverse engineering of gene networks from time-course data by an information theoretic approach. BMC Bioinformatics. 2010;11:154. doi: 10.1186/1471-2105-11-154. - DOI - PMC - PubMed

[5] Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Dalla Favera R, Califano A. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics. 2006;7(Suppl 1):S7. doi: 10.1186/1471-2105-7-S1-S7. - DOI - PMC - PubMed

[6] Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Dalla Favera R, Califano A. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics. 2006;7(Suppl 1):S7. doi: 10.1186/1471-2105-7-S1-S7. - DOI - PMC - PubMed

[7] Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, Kasif S, Collins JJ, Gardner TS. Large-Scale Mapping and Validation of Escherichia coli Transcriptional Regulation from a Compendium of Expression Profiles. PLoS Biol. 2007;5:e8. doi: 10.1371/journal.pbio.0050008. - DOI - PMC - PubMed

[8] Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, Kasif S, Collins JJ, Gardner TS. Large-Scale Mapping and Validation of Escherichia coli Transcriptional Regulation from a Compendium of Expression Profiles. PLoS Biol. 2007;5:e8. doi: 10.1371/journal.pbio.0050008. - DOI - PMC - PubMed

[9] Liang S, Fuhrman S, Somogyi R. Reveal, a general reverse engineering algorithm for inference of genetic network architectures. Pac Symp Biocomput. 1998. pp. 18–29. - PubMed

[10] Liang S, Fuhrman S, Somogyi R. Reveal, a general reverse engineering algorithm for inference of genetic network architectures. Pac Symp Biocomput. 1998. pp. 18–29. - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Learning gene regulatory networks from only positive and unlabeled data

Affiliation

Learning gene regulatory networks from only positive and unlabeled data

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources