Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Sep-Oct;14(5):1070-1081.
doi: 10.1109/TCBB.2016.2520919. Epub 2016 Jan 26.

An Annotation Agnostic Algorithm for Detecting Nascent RNA Transcripts in GRO-Seq

An Annotation Agnostic Algorithm for Detecting Nascent RNA Transcripts in GRO-Seq

Joseph G Azofeifa et al. IEEE/ACM Trans Comput Biol Bioinform. 2017 Sep-Oct.

Abstract

We present a fast and simple algorithm to detect nascent RNA transcription in global nuclear run-on sequencing (GRO-seq). GRO-seq is a relatively new protocol that captures nascent transcripts from actively engaged polymerase, providing a direct read-out on bona fide transcription. Most traditional assays, such as RNA-seq, measure steady state RNA levels which are affected by transcription, post-transcriptional processing, and RNA stability. GRO-seq data, however, presents unique analysis challenges that are only beginning to be addressed. Here, we describe a new algorithm, Fast Read Stitcher (FStitch), that takes advantage of two popular machine-learning techniques, hidden Markov models and logistic regression, to classify which regions of the genome are transcribed. Given a small user-defined training set, our algorithm is accurate, robust to varying read depth, annotation agnostic, and fast. Analysis of GRO-seq data without a priori need for annotation uncovers surprising new insights into several aspects of the transcription process.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1. A schematic showing how contig length and coverage statistics discriminate active from inactive nascent transcription
Regions of active transcription contain many long contigs (positive length, not drawn to scale) with significant read coverage (labeled in blue) interspersed with short regions of no coverage. Coverage statistics define mean, median, mode and variance of reads (black bars) across a contig, see Table S1. In segments with no reads, a gap (labeled in green) is defined by a negative length value and all coverage statistics are set to zero. For our algorithm, reads (grey bars) are represented by only their 5′ position (black points). Therefore a contig is also a continuous region where every base has at least one read’s 5′ end at that position. Consequently, small gaps between contigs have a high probability of being in an active call.
Fig. 2
Fig. 2. Read coverage features are not linearly separable
Points colored green represent training examples labeled active and those colored red indicate training examples labeled inactive. The blue shading provides a contour plot of the active state probability given the feature’s average read coverage (x3, y-axis) and the gap length between adjacent contigs (x1, x-axis in log nucleotides). (A) uses logistic regression with a linear kernel function (i.e. d = 1 in equation 3), whereas (B) uses a second-order polynomial kernel function (i.e. d = 2 in equation 3).
Fig. 3
Fig. 3. FStitch output at BRPF3
An IGV snapshot showing a subregion in chromosome 6 around BRPF3. The first track shows typical GRO-seq data from the HCT116 dataset, with the positive and negative strand in blue and red, respectively. RefSeq annotations are shown next. FStitch output is below for each strand with green indicating areas of inactive transcriptional activity, blue representing areas of active transcription on the positive strand and red on the negative strand. The scores associated with each classification via the Logistic Regression and Viterbi-provided Markov state sequence are also displayed. Finally, bidirectional predictions are provided at the bottom with a score via the estimated Normal Distribution confidence interval.
Fig. 4
Fig. 4. Examination of the impact of distinct approaches to identify regions of interest
MA-plots were generated by DESeq for each of three distinct methods of determining differential transcription using FStitch active calls. The projection method has two variations, one starts with (A) the DMSO active calls and the other with (B) Nutlin active calls. The other methods are (C) joint and (D) the merge method. See text for details on each method.
Fig. 5
Fig. 5. FStitch requires little training data and is robust to low levels of GRO-seq read coverage
(A) Classification accuracy utilizing successively decreasing amounts of training data to learn feature vector weights, for the polynomial (d = 2 and c = 0; blue and teal) and linear (d = 1 and c = 0; green and red) kernel. (B) Classification accuracy with successively less sequencing depth (dataset size). In this case, we trained on 5% of all available chromosome 1 labels and tested on 50 different subsamples of the curated dataset. T P = true positive rate and F N = false negative rate.
Fig. 6
Fig. 6. Correlation of GRO-seq transcript calls with Pol II ChIP-seq
Pol-II ChIP-seq read density was collected in regions labeled as bidirectional (blue), active (green) or inactive (red) by either FStitch (on left) or Vespucci (on right). Log fold-enrichment is relative to average Pol-II ChIP-seq read density. Statistical significance is assessed via the Kolmogrov-Smirnov test (significance bars colored by p-value). Error bars indicate one standard deviation away from the mean.
Fig. 7
Fig. 7. Active Call Characterization
FStitch active calls on HCT116 DMSO are divided into classes based on overlap with genomic annotations. Unannotated active calls are assigned if they have no overlap to previous annotations on either strand. FStitch called 37,591 active regions.
Fig. 8
Fig. 8. Average Read Coverage of FStitch active calls
FStitch active calls on the positive strand that completely contain a RefSeq annotation were used to calculate the average behavior. Blue and red represent positive and negative strand coverage, respectively. For each active region, the length was divided into 100 uniformly sized proportions and the read coverage was averaged within each bin. The average annotated 3′ end is noted by the line and transcription beyond the annotation is shaded. Here, we require an FStitch to completely overlap a RefSeq annotation and the RefSeq annotation overlap at least 75% of the FStitch call.
Fig. 9
Fig. 9. Histograms comparing the active region calls of FStitch to RefSeq annotations
We plot the distance between the end of an active call and the nearest RefSeq annotation for (A) 5′-ends; (B) 3′-ends. Colors red, blue and green are Hah et. al., Vespucci (grid search parameters) and FStitch active calls, respectively. Histograms are probability normalized.
Fig. 10
Fig. 10. Bidirectional predictions and active FStitch calls connected by a ChIA-PET read pair show correlated GRO-seq transcription
The GRO-seq transcription level of ChIA-PET read pairs that overlap a bidirectional call and an active call on either end are plotted, demonstrating a strong correlation (ρ = 0.8301) in transcription (as measured by GRO-seq). Points are colored according to genomic distance (kb) between bidirectional prediction and active call.
Fig. 11
Fig. 11. The overlap between FStitch and Allen et. al. at RefSeq genes
The gene sets called as differentially transcribed by the two methods, Allen et. al. (red) and FStitch (blue), are compared at gene annotations (black numbers). The histogram on the right shows the percentage of each annotated region that is called as differently transcribed by FStitch. When the overlap to a gene is required to be > 75% (green box), 129 genes are no longer called as differentially transcribed by FStitch, including 45 genes that were previously called by both methods.
Fig. 12
Fig. 12. Differential Transcription at PVPR4
An IGV snapshot showing PVPR4, a negative strand gene where a small portion of the gene is differentially transcribed. The region of differential transcription (black bars) overlaps both FStitch bidirectional calls (blue bars) and p53 binding sites (green bars), indicating this may be an intragenic enhancer. The tracks, in order, are: histograms of the GRO-seq signal observed in DMSO and Nutlin, respectively (positive strand: blue; negative strand: red); RefSeq annotation for PVPR4; FStitch bidirectional calls in both DMSO and Nutlin, respectively (blue bars); FStitch differential transcription calls (black bars: top is negative strand, bottom is positive strand); location of p53 binding events (in green).
Fig. 13
Fig. 13. Overlap of differential transcription and p53 marks
FStitch calls were grouped significance of differential transcription (significant: DESeq adj. p-value < 0.1) and overlap with a RefSeq annotation. From top to bottom, there are 64,899 regions without differential transcription (insignificant) and without overlapping annotation (unannotated); 782 significant-unannotated; 23,986 insignificant-annotated; and 262 significant-annotated, respectively. p53 binding site (ChIP) overlap and p53 motif presence are assessed as described in the text.
Fig. 14
Fig. 14. Overlap of differential transcription with enhancer marks
FStitch calls that do not overlap any RefSeq annotation were grouped by differential transcription by DESeq (significant: adj p-value < 0.1). Regions were assessed for overlap with enhancer marks: H3k27ac, H3K4me1, and DNAse I hypersensitivity [26], [28].

Similar articles

Cited by

References

    1. Core LJ, Waterfall JJ, Lis JT. Nascent RNA sequencing reveals widespread pausing and divergent initiation at human promoters. Science. 2008 Dec;322(5909):1845–1848. - PMC - PubMed
    1. Kapranov P, Willingham AT, Gingeras TR. Genome-wide transcription and the implications for genomic organization. Nat Rev Genet. 2007;8(6):413–423. - PubMed
    1. Neymotin B, Athanasiadou R, Gresham D. Determination of in vivo rna kinetics using rate-seq. RNA. 2014;20(10):1645–1652. - PMC - PubMed
    1. Danko CG, Hyland SL, Core LJ, Martins AL, Waters CT, Lee HW, Cheung VG, Kraus WL, Lis JT, Siepel A. Identification of active transcriptional regulatory elements from GRO-seq data. Nat Meth. 2015;12(5):433–438. - PMC - PubMed
    1. Min I, Waterfall J, Core L, Munroe R, Schimenti J, Lis J. Regulating RNA polymerase pausing and transcription elongation in embryonic stem cells. Genes & Development. 2011;25(7):742–754. [Online]. Available: http://genesdev.cshlp.org/content/25/7/742.abstract. - PMC - PubMed

Publication types