Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jan;24(1):154-66.
doi: 10.1101/gr.164327.113. Epub 2013 Oct 29.

A unified model for yeast transcript definition

Affiliations

A unified model for yeast transcript definition

Carl G de Boer et al. Genome Res. 2014 Jan.

Abstract

Identifying genes in the genomic context is central to a cell's ability to interpret the genome. Yet, in general, the signals used to define eukaryotic genes are poorly described. Here, we derived simple classifiers that identify where transcription will initiate and terminate using nucleic acid sequence features detectable by the yeast cell, which we integrate into a Unified Model (UM) that models transcription as a whole. The cis-elements that denote where transcription initiates function primarily through nucleosome depletion, and, using a synthetic promoter system, we show that most of these elements are sufficient to initiate transcription in vivo. Hrp1 binding sites are the major characteristic of terminators; these binding sites are often clustered in terminator regions and can terminate transcription bidirectionally. The UM predicts global transcript structure by modeling transcription of the genome using a hidden Markov model whose emissions are the outputs of the initiation and termination classifiers. We validated the novel predictions of the UM with available RNA-seq data and tested it further by directly comparing the transcript structure predicted by the model to the transcription generated by the cell for synthetic DNA segments of random design. We show that the UM identifies transcription start sites more accurately than the initiation classifier alone, indicating that the relative arrangement of promoter and terminator elements influences their function. Our model presents a concrete description of how the cell defines transcript units, explains the existence of nongenic transcripts, and provides insight into genome evolution.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Design, refinement, and performance of the classifiers. (A) Classifier pipeline. Training and test examples were generated by calculating the relevant features (rounded boxes) using the DNA sequence of the example. The features were calculated over the bins shown in the colored boxes. At the bottom of the colored boxes, the components of the minimal feature sets are shown. Feature colors represent the feature type, including transcription factors (TF), general transcription factors (GTF), base content (%NT), or RNA-binding proteins (RBP). (B) ROC curves representing initiation and termination classifiers with either all features tested or the minimal feature sets, derived from the test data. The line y = x represents the curve expected by random classification.
Figure 2.
Figure 2.
Properties of terminators. (A) Base content surrounding the 3457 optimal nonoverlapping Hrp1 binding sites in terminator regions (≤150 bp upstream of CPA site). Colors indicate the base at the corresponding position, from 25 bp upstream of to 25 bp downstream from the motif match. (B) Alignment of RNA-seq reads corresponding to poly(A) sites (Nagalakshmi et al. 2008) on both DNA strands for convergent intergenic regions (Conv.) and elsewhere in the genome (Else.). Data are aligned to sense poly(A) sites and include all poly(A) sites in the genome with at least two reads. The data represent the average read count of all aligned loci and are smoothed over a 5-bp window.
Figure 3.
Figure 3.
Construction and analysis of the combinatorial promoter library. (A) Synthetic double-stranded promoter fragments with complementary overhangs were ligated together to yield full-length promoters, which were then cloned into a GFP expression vector. We used flow cytometry and sequencing to measure the expression level of each promoter (see Methods). (B) Point-density scatter plot showing the correlation between the initiation score and the expression level (as described in Methods, log-scale). Darkness corresponds to point density. Horizontal and vertical lines indicate the expression level and initiation score thresholds for considering sequences “expressed” and a “predicted promoter,” respectively. (C–G) Identical to B but divided into promoters containing (C) Reb1, (D) Abf1, (E) Rap1, and (F) Rsc3 binding sites in the −150:−80 bin, and (G) the TATA box in the −80:−50 bin. (H–J) Point-density scatter plots showing the expression level of promoters that are identical except for the presence or absence of functional (H) Rap1, (I) Rsc3, or (J) Spt15 (TBP) binding sites. The line y = x marks the point at which expression is identical between the two promoters, regardless of the binding site's presence. The other GRFs (Abf1 and Reb1) are similar to Rap1 (H).
Figure 4.
Figure 4.
A genome-scale yeast transcript model. (A) The structure of the Unified Model HMM. Circles represent states and arrows represent interstate transitions. Inside state circles, the number of bases the model expects to remain in each state is shown in parentheses. Transition probabilities, as a percent of outgoing transitions, are shown on transition arrows. Very infrequent transitions (probability < 1%) are not shown. (IG) Intergenic state. (B) Genome Browser display illustrating the predictions of the models at the GAL1-10 locus of chromosome 2. The tracks on the top half represent data for the forward strand of DNA, with the reverse strand on the lower half. From center: blue bars represent genes, with thinner bars representing UTRs, and the gray bar represents a Ty element. Black tracks represent RNA-seq read density on a log scale (Levin et al. 2010). The Unified Model's predictions are shown with dark green, blue, and red on a single track representing the probability of being in each of the states, where the probabilities are shown stacked. The light green and red tracks on the outer edge represent the scores for the initiation and termination classifiers, respectively. Initiation peaks corresponding to the true TSS and other potential TSSs for the CHS3 gene are as indicated, and some examples of predicted nongenic transcripts that are supported by RNA-seq are shown boxed.
Figure 5.
Figure 5.
Performance of the UM. (A) ROC curves illustrating how well the UM predicts TSSs, transcripts, and CPA sites, when classifying the positive and negative examples for the initiation and termination classifiers, as well as ORFs/transcripts and nontranscript bases. (B) ROC curve comparing the ability of both the UM and initiation classifier to distinguish between TSSs and bases that are part of nondubious ORFs. The line y = x represents the curve expected by random classification.
Figure 6.
Figure 6.
Predicted transcript structure and measured expression of two of the four randomly generated 6-kb fragments. Tracks as in Figure 4, except that “Expression” was measured using custom Agilent tiling arrays. The construct names are indicated in the center, and discernible transcripts are labeled along the tiling array data. A1B1 and A2B2 constructs are shown in Supplemental Figure 8 and show very similar results.
Figure 7.
Figure 7.
Gene definition model. (A) In the absence of transcription, the DNA forms nucleosomes except where prevented by bound TFs (such as the GRFs) or by the DNA structure. (B) Transcription begins indiscriminately from nucleosome-free regions in proportion to the efficiency of pre-initiation complex (PIC) formation. Promoters compete with one another in cis through the act of transcription. (C) An equilibrium is reached where some promoters are active and others are repressed. Successful cleavage and polyadenylation reinforces the promoter choice. (D) If a nucleosome-free region is destroyed (for instance, through loss of GRF binding), it is no longer competent for initiating transcription. Downstream promoters are then de-repressed, become active, and a new equilibrium is reached.

References

    1. Adams CC, Workman JL 1995. Binding of disparate transcriptional activators to nucleosomal DNA is inherently cooperative. Mol Cell Biol 15: 1405–1421 - PMC - PubMed
    1. Alexandersson M, Cawley S, Pachter L 2003. SLAM: Cross-species gene finding and alignment with a generalized pair hidden Markov model. Genome Res 13: 496–502 - PMC - PubMed
    1. Allan J, Fraser RM, Owen-Hughes T, Keszenman-Pereyra D 2012. Micrococcal nuclease does not substantially bias nucleosome mapping. J Mol Biol 417: 152–164 - PMC - PubMed
    1. Alper H, Moxley J, Nevoigt E, Fink GR, Stephanopoulos G 2006. Engineering yeast transcription machinery for improved ethanol tolerance and production. Science 314: 1565–1568 - PubMed
    1. Ansari A, Hampsey M 2005. A role for the CPF 3′-end processing machinery in RNAP II-dependent gene looping. Genes Dev 19: 2969–2978 - PMC - PubMed

Publication types