. 2011 Oct 1;27(19):2633-40.

doi: 10.1093/bioinformatics/btr458. Epub 2011 Aug 8.

FDM: a graph-based statistical method to detect differential transcription using RNA-seq data

Darshan Singh¹, Christian F Orellana, Yin Hu, Corbin D Jones, Yufeng Liu, Derek Y Chiang, Jinze Liu, Jan F Prins

Affiliations

PMID: 21824971
PMCID: PMC3179659
DOI: 10.1093/bioinformatics/btr458

FDM: a graph-based statistical method to detect differential transcription using RNA-seq data

Darshan Singh et al. Bioinformatics. 2011.

. 2011 Oct 1;27(19):2633-40.

doi: 10.1093/bioinformatics/btr458. Epub 2011 Aug 8.

Authors

Darshan Singh¹, Christian F Orellana, Yin Hu, Corbin D Jones, Yufeng Liu, Derek Y Chiang, Jinze Liu, Jan F Prins

Affiliation

¹ Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA. darshan@email.unc.edu

PMID: 21824971
PMCID: PMC3179659
DOI: 10.1093/bioinformatics/btr458

Abstract

Motivation: In eukaryotic cells, alternative splicing expands the diversity of RNA transcripts and plays an important role in tissue-specific differentiation, and can be misregulated in disease. To understand these processes, there is a great need for methods to detect differential transcription between samples. Our focus is on samples observed using short-read RNA sequencing (RNA-seq).

Methods: We characterize differential transcription between two samples as the difference in the relative abundance of the transcript isoforms present in the samples. The magnitude of differential transcription of a gene between two samples can be measured by the square root of the Jensen Shannon Divergence (JSD*) between the gene's transcript abundance vectors in each sample. We define a weighted splice-graph representation of RNA-seq data, summarizing in compact form the alignment of RNA-seq reads to a reference genome. The flow difference metric (FDM) identifies regions of differential RNA transcript expression between pairs of splice graphs, without need for an underlying gene model or catalog of transcripts. We present a novel non-parametric statistical test between splice graphs to assess the significance of differential transcription, and extend it to group-wise comparison incorporating sample replicates.

Results: Using simulated RNA-seq data consisting of four technical replicates of two samples with varying transcription between genes, we show that (i) the FDM is highly correlated with JSD* (r=0.82) when average RNA-seq coverage of the transcripts is sufficiently deep; and (ii) the FDM is able to identify 90% of genes with differential transcription when JSD* >0.28 and coverage >7. This represents higher sensitivity than Cufflinks (without annotations) and rDiff (MMD), which respectively identified 69 and 49% of the genes in this region as differential transcribed. Using annotations identifying the transcripts, Cufflinks was able to identify 86% of the genes in this region as differentially transcribed. Using experimental data consisting of four replicates each for two cancer cell lines (MCF7 and SUM102), FDM identified 1425 genes as significantly different in transcription. Subsequent study of the samples using quantitative real time polymerase chain reaction (qRT-PCR) of several differential transcription sites identified by FDM, confirmed significant differences at these sites.

Availability: http://csbio-linux001.cs.unc.edu/nextgen/software/FDM CONTACT: darshan@email.unc.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
ACT-Graph: the nodes are genome coordinates. A solid (blue) edge represents an exon or part of an exon labeled with the average depth of read coverage along the interval. A dashed (green) edge is a splice edge and is labeled by the number of reads that include the splice. Alternative splicing features such as mutually exclusive exons, a retained intron and a skipped exon are illustrated. Nodes drawn as boxes, circles and hexagons, respectively, represent annotated-only positions, novel-only splice positions and both annotated and novel positions.

**Fig. 2.**
ACT-Graph Compression (Section 2.2.2): plot of file sizes of ACT-Graph (ACTG), FastQ file (FASTQ) and the alignment file (SAM). As the number of reads increases, the storage used by ACT-Graph increases orders of magnitude more slowly than other representations.

**Fig. 3.**
FDM and JSD illustration: an example for a gene in two samples A and B is shown. The gene has two transcripts with expression ratio of 1:4 and 4:1 in the two samples, respectively. The FDM is computed using the two ACT-Graphs. The ACT-Graphs have two divergence nodes: node n2 has outdegree 2 and node n5 has indegree 2. FDM(A,B) = = . The JSD is computed using the ground truth knowledge of the transcript abundance vectors. V_A = [0.2,0.8] and V_B = [0.8,0.2]. = 0.28. Thus, JSD* = 0.53 is the magnitude of differential transcription representing ground truth.

formula image — **Fig. 3.**
FDM and JSD illustration: an example for a gene in two samples A and B is shown. The gene has two transcripts with expression ratio of 1:4 and 4:1 in the two samples, respectively. The FDM is computed using the two ACT-Graphs. The ACT-Graphs have two divergence nodes: node n2 has outdegree 2 and node n5 has indegree 2. FDM(A,B) = = . The JSD is computed using the ground truth knowledge of the transcript abundance vectors. V_A = [0.2,0.8] and V_B = [0.8,0.2]. = 0.28. Thus, JSD* = 0.53 is the magnitude of differential transcription representing ground truth.

**Fig. 4.**
Sensitivity and specificity of the FDM as a function of read coverage (Section 3.1.1 & 3.1.2) : Synthetic data of three sample pairs of 1500 genes each is analyzed. The first sample pairs have low gene coverage (coverage = [0,5]), the second sample pairs have medium gene coverage (coverage = [10,15]), and the third sample pairs have high gene coverage (coverage of 20 or higher). **(A)** JSD* - FDM Correlation: The points in the scatter plots correspond to (JSD*, FDM) values for a gene, where JSD* is ground truth and FDM is computed from ACT-Graphs. When the average gene coverage is high, the correlation between JSD* and FDM is high. For average coverage higher than 20, the correlation is 0.819. **(B)** FDM as a classifier for JSD*: a gene is marked positive for differential transcription if JSD* is more than 0.22 and negative otherwise. FDM is used to classify genes as positive or negative. Thus for each value of FDM, we get some true positives and some false positives. By varying FDM, the complete curve is plotted. The FDM values of (0.01,0.02,0.04,0.08,0.16,0.32.0.64) are marked on the curve. With coverage of 20 or higher, 90 % of true positives can be identified with ~10% false positives.

**Fig. 5.**
Detection of differential transcription by different methods. The circles in scatterplots (a–d) represent 2100 genes in two samples with varying differential transcription (measured by JSD*) and varying depth of RNA-seq sampling (measured by the average coverage per transcribed nucleotide). Filled circles correspond to genes with significant differential transcription according to each of the methods. (a) FDM consistently identifies differential transcription when coverage is high or JSD* is high. For example, for genes with JSD* >0.28 and log(coverage) >0.85 (coverage >7), FDM was able to identify 90% of the genes as differentially transcribed. Two other methods not using annotations, (c) Cuffdiff (Trapnell *et al.* (2010) without annotations) and (d) rDiff (MMD) Stegle *et al.* (2010), had lower sensitivity, identifying differential transcription in 68 and 49% of the genes in this region, respectively. (b) For comparison, we also ran Cuffdiff with gene annotations, which identified differential transcription in 86% of the genes in this region.

**Fig. 6.**
UCSC browser: Gene CD46 in MCF7 and SUM102 (Section 3.2). The first four samples are from MCF7 and next four samples are from SUM102. This gene was identified as a gene with differential expression using FDM methodology. Note that the middle exon is skipped in different ratios in MCF7 and SUM102. This result was verifed by qRT-PCR. Additional figures are provided in Supplementary Material.

See this image and copyright information in PMC

Cited by

A survey of best practices for RNA-seq data analysis.
Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, McPherson A, Szcześniak MW, Gaffney DJ, Elo LL, Zhang X, Mortazavi A. Conesa A, et al. Genome Biol. 2016 Jan 26;17:13. doi: 10.1186/s13059-016-0881-8. Genome Biol. 2016. PMID: 26813401 Free PMC article. Review.
DiffSplice: the genome-wide detection of differential splicing events with RNA-seq.
Hu Y, Huang Y, Du Y, Orellana CF, Singh D, Johnson AR, Monroy A, Kuan PF, Hammond SM, Makowski L, Randell SH, Chiang DY, Hayes DN, Jones C, Liu Y, Prins JF, Liu J. Hu Y, et al. Nucleic Acids Res. 2013 Jan;41(2):e39. doi: 10.1093/nar/gks1026. Epub 2012 Nov 15. Nucleic Acids Res. 2013. PMID: 23155066 Free PMC article.
Efficient experimental design and analysis strategies for the detection of differential expression using RNA-Sequencing.
Robles JA, Qureshi SE, Stephen SJ, Wilson SR, Burden CJ, Taylor JM. Robles JA, et al. BMC Genomics. 2012 Sep 17;13:484. doi: 10.1186/1471-2164-13-484. BMC Genomics. 2012. PMID: 22985019 Free PMC article.
Functional genomic analysis and neuroanatomical localization of miR-2954, a song-responsive sex-linked microRNA in the zebra finch.
Lin YC, Balakrishnan CN, Clayton DF. Lin YC, et al. Front Neurosci. 2014 Dec 16;8:409. doi: 10.3389/fnins.2014.00409. eCollection 2014. Front Neurosci. 2014. PMID: 25565940 Free PMC article.
Leveraging transcript quantification for fast computation of alternative splicing profiles.
Alamancos GP, Pagès A, Trincado JL, Bellora N, Eyras E. Alamancos GP, et al. RNA. 2015 Sep;21(9):1521-31. doi: 10.1261/rna.051557.115. Epub 2015 Jul 15. RNA. 2015. PMID: 26179515 Free PMC article.

See all "Cited by" articles

References

1. Bohnert R., Rätsch G. rQuant.web: a tool for RNA-Seq-based transcript quantitation. Nucleic Acids Res. 2010;38(Suppl. 2):W348–W351. - PMC - PubMed
1. Guttman M., et al. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat. Biotechnol. 2010;28:503–510. - PMC - PubMed
1. Hansen K.D., et al. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 2010;38:e131. - PMC - PubMed
1. Heber S., et al. Splicing graphs and EST assembly problem. Bioinformatics. 2002;18(Suppl. 1):S181–S188. - PubMed
1. Hu Y., et al. A probabilistic framework for aligning paired-end RNA-seq data. Bioinformatics. 2010;26:1950–1957. - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

FDM: a graph-based statistical method to detect differential transcription using RNA-seq data

Affiliation

FDM: a graph-based statistical method to detect differential transcription using RNA-seq data

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources