Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009;2009(1):601068.
doi: 10.1155/2009/601068. Epub 2009 Jun 11.

Modelling transcriptional regulation with a mixture of factor analyzers and variational Bayesian expectation maximization

Affiliations

Modelling transcriptional regulation with a mixture of factor analyzers and variational Bayesian expectation maximization

Kuang Lin et al. EURASIP J Bioinform Syst Biol. 2009.

Abstract

Understanding the mechanisms of gene transcriptional regulation through analysis of high-throughput postgenomic data is one of the central problems of computational systems biology. Various approaches have been proposed, but most of them fail to address at least one of the following objectives: (1) allow for the fact that transcription factors are potentially subject to posttranscriptional regulation; (2) allow for the fact that transcription factors cooperate as a functional complex in regulating gene expression, and (3) provide a model and a learning algorithm with manageable computational complexity. The objective of the present study is to propose and test a method that addresses these three issues. The model we employ is a mixture of factor analyzers, in which the latent variables correspond to different transcription factors, grouped into complexes or modules. We pursue inference in a Bayesian framework, using the Variational Bayesian Expectation Maximization (VBEM) algorithm for approximate inference of the posterior distributions of the model parameters, and estimation of a lower bound on the marginal likelihood for model selection. We have evaluated the performance of the proposed method on three criteria: activity profile reconstruction, gene clustering, and network inference.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Transcriptional regulatory network. (a) A transcriptional regulatory network in the form of a bipartite graph, in which a small number of transcription factors (TFs), represented by circles, regulate a large number of genes (represented by squares) by binding to their promoter regions. The black lines in the square boxes indicate gene expression profiles, that is, gene expression values measured under different experimental conditions or for different time points. The black lines in the circles represent TF activity profiles, that is, the concentrations of the TF subpopulation capable of DNA binding. Note that these TF activity profiles are usually unobserved owing to posttranslational modifications, and should hence be included as hidden or latent variables in the statistical model. (b) A more accurate representation of transcriptional regulation that allows for the cooperation of several TFs forming functional complexes; this complex formation is particularly common in higher eukaryotes.
Figure 2
Figure 2
Bayesian mixture of factor analyzers (MFA) model applied to transcriptional regulation. The figure shows a probabilistic independence graph of the Bayesian mixture of factor analyzers (MFA) model proposed in Section 3. Variables are represented by circles, and hyperparameters are shown as square boxes in the graph. formula image components (factor analyzers), each with their own parameters formula image and formula image, are used to model the expression profiles formula image and TF binding profiles formula image of formula image genes. The factor loadings formula image have a zero-mean Gaussian prior distribution, whose precision hyperparameters formula image are given a gamma distribution determined by formula image and formula image. The analyzer displacements formula image and formula image have Gaussian priors determined by the hyperparameters formula image and formula image, respectively. The indicator variables formula image select one out of formula image factor analyzers, and the associated latent variables or factors formula image have normal prior distributions. The indicator variables formula image are given a multinomial distribution, whose parameter vector formula image, the so-called mixture proportions, have a conjugate Dirichlet prior with hyperparameters formula image. formula image and formula image are the diagonal covariance matrices of the Gaussian noise in the expression and binding profiles, respectively. A dashed rectangle denotes a plate, that is an iid repetition over the genes formula image or the mixture components formula image, respectively. The biological interpretation of the model is as follows. formula image represents the composition of the formula imageth transcriptional module, that is, it indicates which TFs bind cooperatively to the promoters of the regulated genes. formula image allows for perturbations that result, for example, from the temporary inaccessibility of certain binding sites or a variability of the binding affinities caused by external influences. formula image is the background gene expression profile. formula image represents the activity profile of the formula imageth transcriptional module, which modulates the expression levels of the regulated genes. formula image describes the gene-specific susceptibility to transcriptional regulation, that is, to what extent the expression of the formula imageth gene is influenced by the binding of a transcriptional module to its promoter. A complete description of the model can be found in Section 3.
Figure 3
Figure 3
Simulated TF activity and expression profiles. (a) Simulated activity profiles of six hypothetical TF modules. The other panels show simulated expression profiles of the genes regulated by the corresponding TF module (in the same row). From left to right, the three sets have the corresponding observational noise levels of formula image and formula image. The vertical axes show the activity levels (a) or relative log gene expression ratios (other panels), respectively, which are plotted against 40 hypothetical experiments or time points, represented by the horizontal axes.
Figure 4
Figure 4
Simulated TF binding data. The figure shows simulated TF binding data. The vertical axis in each subfigure represents the 90 genes involved in the regulatory network. From left to right: (a) The binary matrix of connectivity between the 6 TF modules (horizontal axis) and the 90 genes, where black entries represent connections. Each module is composed of one or several TFs. (b) The real binding matrix between TFs (horizontal axis) and genes (vertical axis), with black entries indicating binding. (c), (d) The noisy binding data sets used in the synthetic study, with darker entries indicating higher values. Details can be found in Section 4.1.
Figure 5
Figure 5
In- and out-degree distributions of the simulated TF binding data. (a) The arriving connectivity distribution (in-degree distribution). The number of genes regulated by formula image TFs follows an exponential distribution of formula image for in-degree formula image. (b) The departing connectivity distribution (out-degree distribution). The number of TFs per formula image follows the power-law distribution of formula image for out-degree formula image. Note that an exponential distribution is indicated by a linear relationship between formula image and formula image in a log-linear representation (a), whereas a distribution consistent with the power law is indicated by a linear dependence between formula image and formula image in a double logarithmic representation (b).
Figure 6
Figure 6
TF-gene interactions reconstructed with MFA and BFA from the synthetic data. The figure shows TF-gene interactions predicted with the proposed MFA-VBEM approach, according to (24), and the BFA-Gibbs method, according to (A.32), using the noisy synthetic gene expression profiles of Figure 3, and the synthetic TF binding data sets shown in Figures 4(c), 4(d). (a), (c) correspond to the noisy TF binding data shown in Figure 4(c). (b), (d) correspond to the less noisy TF binding data, shown in Figure 4(d). (a), (b) show the TF-gene interaction strengths predicted with the MFA-VBEM approach. (c), (d) show the corresponding results obtained with the BFA-Gibbs method. The grey shading indicates the predicted strength of the interactions, with white corresponding to the absence of an interaction, and black corresponding to the presence of an interaction. The horizontal axis in each graph represents the 9 TFs that are involved in the regulation of the 90 genes; the latter are represented by the vertical axis of each graph. In each panel, from top to bottom, the three rows correspond to gene expression profile lengths of 10, 20 and 40. The three columns correspond to the three noise levels of the gene expression profiles. From left to right: formula image and formula image. See Section 4.1 for further details.
Figure 7
Figure 7
ROC curves of TF-gene regulatory network reconstruction for the synthetic data with MFA and BFA. This figure shows various receiver operating characteristic (ROC) curves, where the numbers of predicted true positive interactions (vertical axis) are plotted against the numbers of false positive interactions (horizontal axis). Larger areas under the curve (AUC) indicate a better reconstruction accuracy. (a), (b) show the ROC curves obtained from TF binding data alone, without including gene expression profiles. (a) corresponds to the noisy TF binding data shown in Figure 4(c). (b) corresponds to the less noisy TF binding data, shown in Figure 4(d). (c), (d) each composed of 9 graphs, show the predictions obtained with MFA-VBEM from both noisy TF binding and gene expression profiles. (e), (f) also composed of 9 graphs each, show the results obtained with BFA-Gibbs on the same data. The arrangement of the graphs is the same as in Figure 6. The results suggest that MFA-VBEM systematically outperforms BFA-Gibbs. They also suggest that for noisy TF binding data (c), (e), the inclusion of gene expression profiles and the application of MFA-VBEM leads to an improvement in the TF-gene regulatory network reconstruction.
Figure 8
Figure 8
TF regulatory network reconstruction for yeast. Receiver operating characteristic (ROC) curves obtained for S. cerevisiae with three different methods: (1) solid line: the proposed MFA-VBEM method, based on the work of [23], and extended as described in Section 3; (2) dashed line: the Bayesian FA model with Gibbs sampling, as proposed in Sabatti and James [16]; and (3) dotted line: maximum likelihood FA with the EM algorithm of Ghahramani and Hinton [24] and a subsequent varimax rotation [39] of the loading matrix towards maximum sparsity, as proposed in Pournara and Wernisch [18]. (a) The performance on a noisy training set, where 10% false positive interactions had been randomly added to the TF binding profiles from the literature [38], while the computation of the ROC curves was based on the un-perturbed literature data (network curation task). (b) The out-of-sample performance on an independent test set containing genes not used for training (network prediction). Note that in the latter case the Gibbs sampling approach was run twice, with two different prior matrices formula image: a random prior, where for each gene 11 randomly chosen elements in the matrix were nonzero (dashed line); and a "good" prior, where the nonzero elements in formula image were chosen according to Teixeira et al. [38] subject to the maximum connectivity constraint described in the text (dash-dotted line).
Figure 9
Figure 9
Out-of-sample TF regulatory network reconstruction for yeast. Receiver operating characteristic (ROC) curves obtained for S. cerevisiae with three different methods: (1) solid line: the proposed MFA-VBEM method, based on the work of Beal [23], and extended as described in Section 3; (2) dashed line: the Bayesian FA model with Gibbs sampling, as proposed in Sabatti and James [16]; and (3) dotted line: maximum likelihood FA with the EM algorithm of Ghahramani and Hinton [24] and a subsequent varimax rotation [39] of the loading matrix towards maximum sparsity, as proposed in Pournara and Wernisch [18]. The subfigures show the out-of-sample performance on an independent test set containing genes not used for training (network prediction). From left to right, the models were trained using 40%, 60% and 80% of data.
Figure 10
Figure 10
Composition of one of the TF complexes in yeast. The figure shows the composition of one of the TF modules (formula image) found with MFA-VBEM for the yeast data. The figure shows a plot of formula image, plotted on the vertical axis against the 12 TFs involved in the study. As explained in the caption of Figure 2, formula image indicates the composition of the formula imageth TF module. It is clearly seen that this TF module is dominated by two TFs, Ste12 and Tec1, and thereby corresponds to a well-established module reported in the literature [51].

Similar articles

Cited by

References

    1. Bussemaker HJ, Li H, Siggia ED. Regulatory element detection using correlation with expression. Nature Genetics. 2001;27(2):167–171. doi: 10.1038/84792. - DOI - PubMed
    1. Conlon EM, Liu XS, Lieb JD, Liu JS. Integrating regulatory motif discovery and genome-wide expression analysis. Proceedings of the National Academy of Sciences of the United States of America. 2003;100(6):3339–3344. doi: 10.1073/pnas.0630591100. - DOI - PMC - PubMed
    1. Beer MA, Tavazoie S. Predicting gene expression from sequence. Cell. 2004;117(2):185–198. doi: 10.1016/S0092-8674(04)00304-6. - DOI - PubMed
    1. Phuong TM, Lee D, Lee KH. Regression trees for regulatory element identification. Bioinformatics. 2004;20(5):750–757. doi: 10.1093/bioinformatics/btg480. - DOI - PubMed
    1. Segal E, Yelensky R, Koller D. Genome-wide discovery of transcriptional modules from DNA sequence and gene expression. Bioinformatics. 2003;19(supplement 1):i273–i282. doi: 10.1093/bioinformatics/btg1038. - DOI - PubMed

LinkOut - more resources