. 2009;2009(1):601068.

doi: 10.1155/2009/601068. Epub 2009 Jun 11.

Modelling transcriptional regulation with a mixture of factor analyzers and variational Bayesian expectation maximization

Kuang Lin¹, Dirk Husmeier

Affiliations

PMID: 19572011
PMCID: PMC3171433
DOI: 10.1155/2009/601068

Modelling transcriptional regulation with a mixture of factor analyzers and variational Bayesian expectation maximization

Kuang Lin et al. EURASIP J Bioinform Syst Biol. 2009.

. 2009;2009(1):601068.

doi: 10.1155/2009/601068. Epub 2009 Jun 11.

Authors

Kuang Lin¹, Dirk Husmeier

Affiliation

¹ Biomathematics & Statistics Scotland (BioSS), Edinburgh, UK.

PMID: 19572011
PMCID: PMC3171433
DOI: 10.1155/2009/601068

Abstract

Understanding the mechanisms of gene transcriptional regulation through analysis of high-throughput postgenomic data is one of the central problems of computational systems biology. Various approaches have been proposed, but most of them fail to address at least one of the following objectives: (1) allow for the fact that transcription factors are potentially subject to posttranscriptional regulation; (2) allow for the fact that transcription factors cooperate as a functional complex in regulating gene expression, and (3) provide a model and a learning algorithm with manageable computational complexity. The objective of the present study is to propose and test a method that addresses these three issues. The model we employ is a mixture of factor analyzers, in which the latent variables correspond to different transcription factors, grouped into complexes or modules. We pursue inference in a Bayesian framework, using the Variational Bayesian Expectation Maximization (VBEM) algorithm for approximate inference of the posterior distributions of the model parameters, and estimation of a lower bound on the marginal likelihood for model selection. We have evaluated the performance of the proposed method on three criteria: activity profile reconstruction, gene clustering, and network inference.

PubMed Disclaimer

Figures

**Figure 1**
**Transcriptional regulatory network**. (a) A transcriptional regulatory network in the form of a bipartite graph, in which a small number of transcription factors (TFs), represented by circles, regulate a large number of genes (represented by squares) by binding to their promoter regions. The black lines in the square boxes indicate gene expression profiles, that is, gene expression values measured under different experimental conditions or for different time points. The black lines in the circles represent TF activity profiles, that is, the concentrations of the TF subpopulation capable of DNA binding. Note that these TF activity profiles are usually unobserved owing to posttranslational modifications, and should hence be included as hidden or latent variables in the statistical model. (b) A more accurate representation of transcriptional regulation that allows for the cooperation of several TFs forming functional complexes; this complex formation is particularly common in higher eukaryotes.

formula image — **Figure 2**
**Bayesian mixture of factor analyzers (MFA) model applied to transcriptional regulation**. The figure shows a probabilistic independence graph of the Bayesian mixture of factor analyzers (MFA) model proposed in Section 3. Variables are represented by circles, and hyperparameters are shown as square boxes in the graph. components (factor analyzers), each with their own parameters and , are used to model the expression profiles and TF binding profiles of genes. The factor loadings have a zero-mean Gaussian prior distribution, whose precision hyperparameters are given a gamma distribution determined by and . The analyzer displacements and have Gaussian priors determined by the hyperparameters and , respectively. The indicator variables select one out of factor analyzers, and the associated latent variables or factors have normal prior distributions. The indicator variables are given a multinomial distribution, whose parameter vector , the so-called mixture proportions, have a conjugate Dirichlet prior with hyperparameters . and are the diagonal covariance matrices of the Gaussian noise in the expression and binding profiles, respectively. A dashed rectangle denotes a plate, that is an iid repetition over the genes or the mixture components , respectively. The biological interpretation of the model is as follows. represents the composition of the th transcriptional module, that is, it indicates which TFs bind cooperatively to the promoters of the regulated genes. allows for perturbations that result, for example, from the temporary inaccessibility of certain binding sites or a variability of the binding affinities caused by external influences. is the background gene expression profile. represents the activity profile of the th transcriptional module, which modulates the expression levels of the regulated genes. describes the gene-specific susceptibility to transcriptional regulation, that is, to what extent the expression of the th gene is influenced by the binding of a transcriptional module to its promoter. A complete description of the model can be found in Section 3.

**Figure 3**
**Simulated TF activity and expression profiles**. (a) Simulated activity profiles of six hypothetical TF modules. The other panels show simulated expression profiles of the genes regulated by the corresponding TF module (in the same row). From left to right, the three sets have the corresponding observational noise levels of and . The vertical axes show the activity levels (a) or relative log gene expression ratios (other panels), respectively, which are plotted against 40 hypothetical experiments or time points, represented by the horizontal axes.

**Figure 4**
**Simulated TF binding data**. The figure shows simulated TF binding data. The vertical axis in each subfigure represents the 90 genes involved in the regulatory network. From left to right: (a) The binary matrix of connectivity between the 6 TF modules (horizontal axis) and the 90 genes, where black entries represent connections. Each module is composed of one or several TFs. (b) The real binding matrix between TFs (horizontal axis) and genes (vertical axis), with black entries indicating binding. (c), (d) The noisy binding data sets used in the synthetic study, with darker entries indicating higher values. Details can be found in Section 4.1.

**Figure 5**
**In- and out-degree distributions of the simulated TF binding data**. (a) The arriving connectivity distribution (in-degree distribution). The number of genes regulated by TFs follows an exponential distribution of for in-degree . (b) The departing connectivity distribution (out-degree distribution). The number of TFs per follows the power-law distribution of for out-degree . Note that an exponential distribution is indicated by a linear relationship between and in a log-linear representation (a), whereas a distribution consistent with the power law is indicated by a linear dependence between and in a double logarithmic representation (b).

**Figure 6**
**TF-gene interactions reconstructed with MFA and BFA from the synthetic data**. The figure shows TF-gene interactions predicted with the proposed MFA-VBEM approach, according to (24), and the BFA-Gibbs method, according to (A.32), using the noisy synthetic gene expression profiles of Figure 3, and the synthetic TF binding data sets shown in Figures 4(c), 4(d). (a), (c) correspond to the noisy TF binding data shown in Figure 4(c). (b), (d) correspond to the less noisy TF binding data, shown in Figure 4(d). (a), (b) show the TF-gene interaction strengths predicted with the MFA-VBEM approach. (c), (d) show the corresponding results obtained with the BFA-Gibbs method. The grey shading indicates the predicted strength of the interactions, with white corresponding to the absence of an interaction, and black corresponding to the presence of an interaction. The horizontal axis in each graph represents the 9 TFs that are involved in the regulation of the 90 genes; the latter are represented by the vertical axis of each graph. In each panel, from top to bottom, the three rows correspond to gene expression profile lengths of 10, 20 and 40. The three columns correspond to the three noise levels of the gene expression profiles. From left to right: and . See Section 4.1 for further details.

**Figure 7**
**ROC curves of TF-gene regulatory network reconstruction for the synthetic data with MFA and BFA**. This figure shows various receiver operating characteristic (ROC) curves, where the numbers of predicted true positive interactions (vertical axis) are plotted against the numbers of false positive interactions (horizontal axis). Larger areas under the curve (AUC) indicate a better reconstruction accuracy. (a), (b) show the ROC curves obtained from TF binding data alone, without including gene expression profiles. (a) corresponds to the noisy TF binding data shown in Figure 4(c). (b) corresponds to the less noisy TF binding data, shown in Figure 4(d). (c), (d) each composed of 9 graphs, show the predictions obtained with MFA-VBEM from both noisy TF binding and gene expression profiles. (e), (f) also composed of 9 graphs each, show the results obtained with BFA-Gibbs on the same data. The arrangement of the graphs is the same as in Figure 6. The results suggest that MFA-VBEM systematically outperforms BFA-Gibbs. They also suggest that for noisy TF binding data (c), (e), the inclusion of gene expression profiles and the application of MFA-VBEM leads to an improvement in the TF-gene regulatory network reconstruction.

**Figure 8**
**TF regulatory network reconstruction for yeast**. Receiver operating characteristic (ROC) curves obtained for *S. cerevisiae* with three different methods: (1) solid line: the proposed MFA-VBEM method, based on the work of [23], and extended as described in Section 3; (2) dashed line: the Bayesian FA model with Gibbs sampling, as proposed in Sabatti and James [16]; and (3) dotted line: maximum likelihood FA with the EM algorithm of Ghahramani and Hinton [24] and a subsequent varimax rotation [39] of the loading matrix towards maximum sparsity, as proposed in Pournara and Wernisch [18]. (a) The performance on a noisy training set, where 10% false positive interactions had been randomly added to the TF binding profiles from the literature [38], while the computation of the ROC curves was based on the un-perturbed literature data (network curation task). (b) The out-of-sample performance on an independent test set containing genes not used for training (network prediction). Note that in the latter case the Gibbs sampling approach was run twice, with two different prior matrices : a random prior, where for each gene 11 randomly chosen elements in the matrix were nonzero (dashed line); and a "good" prior, where the nonzero elements in were chosen according to Teixeira et al. [38] subject to the maximum connectivity constraint described in the text (dash-dotted line).

**Figure 9**
**Out-of-sample TF regulatory network reconstruction for yeast**. Receiver operating characteristic (ROC) curves obtained for *S. cerevisiae* with three different methods: (1) solid line: the proposed MFA-VBEM method, based on the work of Beal [23], and extended as described in Section 3; (2) dashed line: the Bayesian FA model with Gibbs sampling, as proposed in Sabatti and James [16]; and (3) dotted line: maximum likelihood FA with the EM algorithm of Ghahramani and Hinton [24] and a subsequent varimax rotation [39] of the loading matrix towards maximum sparsity, as proposed in Pournara and Wernisch [18]. The subfigures show the out-of-sample performance on an independent test set containing genes not used for training (network prediction). From left to right, the models were trained using 40%, 60% and 80% of data.

**Figure 10**
**Composition of one of the TF complexes in yeast**. The figure shows the composition of one of the TF modules () found with MFA-VBEM for the yeast data. The figure shows a plot of , plotted on the vertical axis against the 12 TFs involved in the study. As explained in the caption of Figure 2, indicates the composition of the th TF module. It is clearly seen that this TF module is dominated by two TFs, Ste12 and Tec1, and thereby corresponds to a well-established module reported in the literature [51].

See this image and copyright information in PMC

References

1. Bussemaker HJ, Li H, Siggia ED. Regulatory element detection using correlation with expression. Nature Genetics. 2001;27(2):167–171. doi: 10.1038/84792. - DOI - PubMed
1. Conlon EM, Liu XS, Lieb JD, Liu JS. Integrating regulatory motif discovery and genome-wide expression analysis. Proceedings of the National Academy of Sciences of the United States of America. 2003;100(6):3339–3344. doi: 10.1073/pnas.0630591100. - DOI - PMC - PubMed
1. Beer MA, Tavazoie S. Predicting gene expression from sequence. Cell. 2004;117(2):185–198. doi: 10.1016/S0092-8674(04)00304-6. - DOI - PubMed
1. Phuong TM, Lee D, Lee KH. Regression trees for regulatory element identification. Bioinformatics. 2004;20(5):750–757. doi: 10.1093/bioinformatics/btg480. - DOI - PubMed
1. Segal E, Yelensky R, Koller D. Genome-wide discovery of transcriptional modules from DNA sequence and gene expression. Bioinformatics. 2003;19(supplement 1):i273–i282. doi: 10.1093/bioinformatics/btg1038. - DOI - PubMed

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Modelling transcriptional regulation with a mixture of factor analyzers and variational Bayesian expectation maximization

Affiliation

Modelling transcriptional regulation with a mixture of factor analyzers and variational Bayesian expectation maximization

Authors

Affiliation

Abstract

Figures

References

LinkOut - more resources

Full Text Sources