Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Apr;37(5):1566-79.
doi: 10.1093/nar/gkn1064. Epub 2009 Jan 16.

Allegro: analyzing expression and sequence in concert to discover regulatory programs

Affiliations

Allegro: analyzing expression and sequence in concert to discover regulatory programs

Yonit Halperin et al. Nucleic Acids Res. 2009 Apr.

Abstract

A major goal of system biology is the characterization of transcription factors and microRNAs (miRNAs) and the transcriptional programs they regulate. We present Allegro, a method for de-novo discovery of cis-regulatory transcriptional programs through joint analysis of genome-wide expression data and promoter or 3' UTR sequences. The algorithm uses a novel log-likelihood-based, non-parametric model to describe the expression pattern shared by a group of co-regulated genes. We show that Allegro is more accurate and sensitive than existing techniques, and can simultaneously analyze multiple expression datasets with more than 100 conditions. We apply Allegro on datasets from several species and report on the transcriptional modules it uncovers. Our analysis reveals a novel motif over-represented in the promoters of genes highly expressed in murine oocytes, and several new motifs related to fly development. Finally, using stem-cell expression profiles, we identify three miRNA families with pivotal roles in human embryogenesis.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Overview of the Allegro computational approach. Given a genome-wide expression matrix and cis-regulatory sequences (promoters or 3′ UTRs), Allegro executes efficient algorithms and statistical analyses to search for transcriptional modules. A transcriptional module is a set of genes sharing a sequence motif, modeled using a PWM, and a common expression profile described using a novel model called CWM. The CWM is analogous to the PWM: it assigns a weight to each discrete expression level in each of the experimental conditions. Allegro uses a multi-phase motif enumeration engine to generate candidate motifs. For each motif, it applies a cross-validation-like procedure to construct a CWM (Supplementary Figure 2), such that there is a significantly large overlap between the targets of the motif (the set of genes whose cis-regulatory sequence has an occurrence of the PWM, left arrows at the top) and the targets of the CWM (the genes whose expression levels match the CWM, right arrows). The statistical significance of this overlap is evaluated using one of two enrichment scores: the HG score or the binned enrichment score, which accounts for biases in the length and GC-content of the cis-regulatory sequences. The scores obtained by the motifs and their CWMs are iteratively modified to improve the models and eventually converge to high-scoring transcriptional modules.
Figure 2.
Figure 2.
Results of Allegro on the human cell cycle dataset (15). (A) Screenshot of Allegro. The left panel presents the input parameters: organism, expression data file, scores, etc. The top-scoring motifs discovered by Allegro are shown in the output panel on the right. Additional information is displayed for each motif, such as the average expression profile of the CWM targets that contain a hit of the motif, statistics on the number of hits and their locations, similar binding patterns from Transfac or miRBase, and more. Here, the three top-scoring motifs reported by Allegro represent the BS patterns of key regulators of the human cell cycle: E2F, CHR (whose binding TF is unknown), and NF-Y (not shown). (B) Expression profiles of the five CWM targets with the highest LLR score of the three motifs found by Allegro. High and low expression values w.r.t. time 0 are colored in red and green, respectively. The purple bars represent S phase and the blue vertical lines indicate mitoses, as reported in (15). In agreement with biological knowledge and previous computational analyses (25–28,50), E2F induces genes mainly in the G1/S phase, whereas CHR and NF-Y are highly specific to the G2 and G2/M phases.
Figure 3.
Figure 3.
Results of Allegro on the yeast HOG pathway expression dataset (14). Allegro finds the motifs PAC, RRPE, STRE and the binding patterns of Rap1, MBF, Ste12 and Sko1. Each motif is presented together with the average expression profile (±1 SD) of its CWM targets which contain a hit for the motif in their promoter. The titles above the expression series indicate the yeast strain the expression was sampled from: WT, and knockout strains [indicated by the name(s) of the gene(s) that were knocked-out]. The concentrations of KCL and sorbitol are given in molar units.
Figure 4.
Figure 4.
The top three 3′ UTR motifs identified in the stem cells dataset (19). On the left, the motif p-value and logo are presented along with the first 11 bases (starting from the 5′ base of the mature microRNA) of miRNAs with a seed that matches the reverse complement of the motif. For the first motif, only one miRNA from each of the four matching miRNA families is presented. For each motif, the graph on the right shows the average expression values (in log2 scale) of the corresponding CWM targets that contain a hit for the motif. Each bar represents the average expression level in one of the cell types (ESCs/NSCs/MSCs—embryonic/neural/mesenchymal stem cells; ‘Undiff.’—Undifferentiated, ‘diff.’—differentiated, ‘Terato.’—Teratocarcinoma; see also Supplementary Table V; the full expression profile of targets of motif 1 in all 124 samples is shown in Supplementary Figure 9). The graph also shows the expression levels (in log2 scale) of the matching miRNA(s): mir-302 for motif 1 (average expression over all mir-302 family members), mir-124 for motif 2 and mir-9 for motif 3. miRNA expression levels are presented only for the cell types profiled in (63). Evidently, the expression profiles of the motif targets and those of the matching miRNAs are anti-correlated, increasing our confidence that the discovered motifs represent miRNAs that are active in the relevant cells.

References

    1. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell. 1998;9:3273–3297. - PMC - PubMed
    1. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM. Systematic determination of genetic network architecture. Nat. Genet. 1999;22:281–285. - PubMed
    1. Wyrick JJ, Young RA. Deciphering gene expression regulatory networks. Curr. Opin. Genet. Dev. 2002;12:130–136. - PubMed
    1. Jiang D, Tang C, Zhang A. Cluster analysis for gene expression data: a survey. IEEE Trans. Knowl. Data Eng. 2004;16:1370–1386.
    1. Holmes I, Bruno WJ. Finding regulatory elements using joint likelihoods for sequence and expression profile data. Proc. Int. Conf. Intell. Syst. Mol. Biol. 2000;8:202–210. - PubMed

Publication types

MeSH terms