Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Jan 15:11:32.
doi: 10.1186/1471-2105-11-32.

The identification of informative genes from multiple datasets with increasing complexity

Affiliations

The identification of informative genes from multiple datasets with increasing complexity

S Yahya Anvar et al. BMC Bioinformatics. .

Abstract

Background: In microarray data analysis, factors such as data quality, biological variation, and the increasingly multi-layered nature of more complex biological systems complicates the modelling of regulatory networks that can represent and capture the interactions among genes. We believe that the use of multiple datasets derived from related biological systems leads to more robust models. Therefore, we developed a novel framework for modelling regulatory networks that involves training and evaluation on independent datasets. Our approach includes the following steps: (1) ordering the datasets based on their level of noise and informativeness; (2) selection of a Bayesian classifier with an appropriate level of complexity by evaluation of predictive performance on independent data sets; (3) comparing the different gene selections and the influence of increasing the model complexity; (4) functional analysis of the informative genes.

Results: In this paper, we identify the most appropriate model complexity using cross-validation and independent test set validation for predicting gene expression in three published datasets related to myogenesis and muscle differentiation. Furthermore, we demonstrate that models trained on simpler datasets can be used to identify interactions among genes and select the most informative. We also show that these models can explain the myogenesis-related genes (genes of interest) significantly better than others (P < 0.004) since the improvement in their rankings is much more pronounced. Finally, after further evaluating our results on synthetic datasets, we show that our approach outperforms a concordance method by Lai et al. in identifying informative genes from multiple datasets with increasing complexity whilst additionally modelling the interaction between genes.

Conclusions: We show that Bayesian networks derived from simpler controlled systems have better performance than those trained on datasets from more complex biological systems. Further, we present that highly predictive and consistent genes, from the pool of differentially expressed genes, across independent datasets are more likely to be fundamentally involved in the biological process under study. We conclude that networks trained on simpler controlled systems, such as in vitro experiments, can be used to model and capture interactions among genes in more complex datasets, such as in vivo experiments, where these interactions would otherwise be concealed by a multitude of other ongoing events.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The comparison of classifiers with increasing model complexity. Three Bayesian network models (SNB, 1PB, and NPB) have been trained using cross-validation set and validated on independent datasets. An average error rate of the classifiers' prediction has been calculated for each gene and an overall SSE on cross-validation set and independent test set are illustrated in this figure.
Figure 2
Figure 2
Evaluating the accuracy of 1PB using different datasets for gene selection. We selected genes using only one dataset (black) at a time and compared the average error rate of 1PB classifier learnt and trained on a same dataset and validated on the other two datasets independently (grey).
Figure 3
Figure 3
The investigation of inference of adding more complexity to the model. We investigated the inference of adding more complexity to the model by adding 50 randomly selected genes as uninformative on 1PB classifier performance. In this figure we compare the average error rate of 1PB classifier after adding 50 uninformative genes to the model.
Figure 4
Figure 4
The improvement or deterioration of genes ranking in Sartorelli. Firstly, we selected 100 informative and 50 uninformative genes using Tomczak dataset and extracted their ranks in Sartorelli. Secondly, we trained 1PB classifier on Tomczak and tested on Sartorelli. Finally, we ranked genes according to the average error rate of 1PB classifier in predicting their values in Sartorelli. This figure illustrates the average improvement or deterioration of Myogenesis-Related, Top 100, and 50 randomly selected genes in Sartorelli generated with our method and the gene rankings generated by concordance model.

References

    1. Bockhorst J, Craven M, Page D, Shavlik J, Glasner J. A Bayesian approach to operon prediction. Bioinformatics. 2003;19:1227–1235. doi: 10.1093/bioinformatics/btg147. - DOI - PubMed
    1. Segal E, Shapira M, Regev A, Pe'er D, Botstein D, Koller D, Friedman N. Module networks: identifying regulatory modules and their condition specific regulators from gene expression data. Nature Genetics. 2003;34:166–176. doi: 10.1038/ng1165. - DOI - PubMed
    1. Friedman N, Linial M, Nachman I, Pe'er D. Using Bayesian networks to analyze expression data. Proceeding of the 4th International Conference on Computational Molecular Biology. 2000. pp. 127–135. - PubMed
    1. Xu X, Wang L, Ding D. Learning module networks from genome-wide location and expression data. FEBS Letters. 2004;587:297–304. doi: 10.1016/j.febslet.2004.11.019. - DOI - PubMed
    1. Grossman D, Domingos P. Learning Bayesian network classifiers by maximizing conditional likelihood. Proceedings of the 21st International Conference on Machine Learning. 2004;69:46–54.

Publication types

MeSH terms

LinkOut - more resources