Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Oct 1;27(19):2633-40.
doi: 10.1093/bioinformatics/btr458. Epub 2011 Aug 8.

FDM: a graph-based statistical method to detect differential transcription using RNA-seq data

Affiliations

FDM: a graph-based statistical method to detect differential transcription using RNA-seq data

Darshan Singh et al. Bioinformatics. .

Abstract

Motivation: In eukaryotic cells, alternative splicing expands the diversity of RNA transcripts and plays an important role in tissue-specific differentiation, and can be misregulated in disease. To understand these processes, there is a great need for methods to detect differential transcription between samples. Our focus is on samples observed using short-read RNA sequencing (RNA-seq).

Methods: We characterize differential transcription between two samples as the difference in the relative abundance of the transcript isoforms present in the samples. The magnitude of differential transcription of a gene between two samples can be measured by the square root of the Jensen Shannon Divergence (JSD*) between the gene's transcript abundance vectors in each sample. We define a weighted splice-graph representation of RNA-seq data, summarizing in compact form the alignment of RNA-seq reads to a reference genome. The flow difference metric (FDM) identifies regions of differential RNA transcript expression between pairs of splice graphs, without need for an underlying gene model or catalog of transcripts. We present a novel non-parametric statistical test between splice graphs to assess the significance of differential transcription, and extend it to group-wise comparison incorporating sample replicates.

Results: Using simulated RNA-seq data consisting of four technical replicates of two samples with varying transcription between genes, we show that (i) the FDM is highly correlated with JSD* (r=0.82) when average RNA-seq coverage of the transcripts is sufficiently deep; and (ii) the FDM is able to identify 90% of genes with differential transcription when JSD* >0.28 and coverage >7. This represents higher sensitivity than Cufflinks (without annotations) and rDiff (MMD), which respectively identified 69 and 49% of the genes in this region as differential transcribed. Using annotations identifying the transcripts, Cufflinks was able to identify 86% of the genes in this region as differentially transcribed. Using experimental data consisting of four replicates each for two cancer cell lines (MCF7 and SUM102), FDM identified 1425 genes as significantly different in transcription. Subsequent study of the samples using quantitative real time polymerase chain reaction (qRT-PCR) of several differential transcription sites identified by FDM, confirmed significant differences at these sites.

Availability: http://csbio-linux001.cs.unc.edu/nextgen/software/FDM CONTACT: darshan@email.unc.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
ACT-Graph: the nodes are genome coordinates. A solid (blue) edge represents an exon or part of an exon labeled with the average depth of read coverage along the interval. A dashed (green) edge is a splice edge and is labeled by the number of reads that include the splice. Alternative splicing features such as mutually exclusive exons, a retained intron and a skipped exon are illustrated. Nodes drawn as boxes, circles and hexagons, respectively, represent annotated-only positions, novel-only splice positions and both annotated and novel positions.
Fig. 2.
Fig. 2.
ACT-Graph Compression (Section 2.2.2): plot of file sizes of ACT-Graph (ACTG), FastQ file (FASTQ) and the alignment file (SAM). As the number of reads increases, the storage used by ACT-Graph increases orders of magnitude more slowly than other representations.
Fig. 3.
Fig. 3.
FDM and JSD illustration: an example for a gene in two samples A and B is shown. The gene has two transcripts with expression ratio of 1:4 and 4:1 in the two samples, respectively. The FDM is computed using the two ACT-Graphs. The ACT-Graphs have two divergence nodes: node n2 has outdegree 2 and node n5 has indegree 2. FDM(A,B) = formula image = formula image. The JSD is computed using the ground truth knowledge of the transcript abundance vectors. VA = [0.2,0.8] and VB = [0.8,0.2]. formula image = 0.28. Thus, JSD* = 0.53 is the magnitude of differential transcription representing ground truth.
Fig. 4.
Fig. 4.
Sensitivity and specificity of the FDM as a function of read coverage (Section 3.1.1 & 3.1.2) : Synthetic data of three sample pairs of 1500 genes each is analyzed. The first sample pairs have low gene coverage (coverage = [0,5]), the second sample pairs have medium gene coverage (coverage = [10,15]), and the third sample pairs have high gene coverage (coverage of 20 or higher). (A) JSD* - FDM Correlation: The points in the scatter plots correspond to (JSD*, FDM) values for a gene, where JSD* is ground truth and FDM is computed from ACT-Graphs. When the average gene coverage is high, the correlation between JSD* and FDM is high. For average coverage higher than 20, the correlation is 0.819. (B) FDM as a classifier for JSD*: a gene is marked positive for differential transcription if JSD* is more than 0.22 and negative otherwise. FDM is used to classify genes as positive or negative. Thus for each value of FDM, we get some true positives and some false positives. By varying FDM, the complete curve is plotted. The FDM values of (0.01,0.02,0.04,0.08,0.16,0.32.0.64) are marked on the curve. With coverage of 20 or higher, 90 % of true positives can be identified with ~10% false positives.
Fig. 5.
Fig. 5.
Detection of differential transcription by different methods. The circles in scatterplots (ad) represent 2100 genes in two samples with varying differential transcription (measured by JSD*) and varying depth of RNA-seq sampling (measured by the average coverage per transcribed nucleotide). Filled circles correspond to genes with significant differential transcription according to each of the methods. (a) FDM consistently identifies differential transcription when coverage is high or JSD* is high. For example, for genes with JSD* >0.28 and log(coverage) >0.85 (coverage >7), FDM was able to identify 90% of the genes as differentially transcribed. Two other methods not using annotations, (c) Cuffdiff (Trapnell et al. (2010) without annotations) and (d) rDiff (MMD) Stegle et al. (2010), had lower sensitivity, identifying differential transcription in 68 and 49% of the genes in this region, respectively. (b) For comparison, we also ran Cuffdiff with gene annotations, which identified differential transcription in 86% of the genes in this region.
Fig. 6.
Fig. 6.
UCSC browser: Gene CD46 in MCF7 and SUM102 (Section 3.2). The first four samples are from MCF7 and next four samples are from SUM102. This gene was identified as a gene with differential expression using FDM methodology. Note that the middle exon is skipped in different ratios in MCF7 and SUM102. This result was verifed by qRT-PCR. Additional figures are provided in Supplementary Material.

Similar articles

Cited by

References

    1. Bohnert R., Rätsch G. rQuant.web: a tool for RNA-Seq-based transcript quantitation. Nucleic Acids Res. 2010;38(Suppl. 2):W348–W351. - PMC - PubMed
    1. Guttman M., et al. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat. Biotechnol. 2010;28:503–510. - PMC - PubMed
    1. Hansen K.D., et al. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 2010;38:e131. - PMC - PubMed
    1. Heber S., et al. Splicing graphs and EST assembly problem. Bioinformatics. 2002;18(Suppl. 1):S181–S188. - PubMed
    1. Hu Y., et al. A probabilistic framework for aligning paired-end RNA-seq data. Bioinformatics. 2010;26:1950–1957. - PMC - PubMed

Publication types