Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Sep 1;30(17):2447-55.
doi: 10.1093/bioinformatics/btu317. Epub 2014 May 9.

Efficient RNA isoform identification and quantification from RNA-Seq data with network flows

Affiliations

Efficient RNA isoform identification and quantification from RNA-Seq data with network flows

Elsa Bernard et al. Bioinformatics. .

Abstract

Motivation: Several state-of-the-art methods for isoform identification and quantification are based on [Formula: see text]-regularized regression, such as the Lasso. However, explicitly listing the-possibly exponentially-large set of candidate transcripts is intractable for genes with many exons. For this reason, existing approaches using the [Formula: see text]-penalty are either restricted to genes with few exons or only run the regression algorithm on a small set of preselected isoforms.

Results: We introduce a new technique called FlipFlop, which can efficiently tackle the sparse estimation problem on the full set of candidate isoforms by using network flow optimization. Our technique removes the need of a preselection step, leading to better isoform identification while keeping a low computational cost. Experiments with synthetic and real RNA-Seq data confirm that our approach is more accurate than alternative methods and one of the fastest available.

Availability and implementation: Source code is freely available as an R package from the Bioconductor Web site (http://www.bioconductor.org/), and more information is available at http://cbio.ensmp.fr/flipflop.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Illustration of the graph construction for a gene with 5 exons. The original splicing graph is represented in (a). The 5 exons are represented as vertices and an arrow between two vertices indicates a junction. The nodes of graph G in (b) and (c) are bins with positive effective length denoted by gray square, as well as source s and sink t represented as circles. G in (b) is the resulting graph when all exons are bigger than the read length. In that case, each bin either corresponds to a unique exon, or to a junction between two exons. G in (c) is the resulting graph when the length of exon 3 is smaller than the read length. Some bins involve then more than two exons, here bins (2-3-4) and (2-3-5). The source links all possible starting bins, and conversely all possible stopping bins are linked to the sink. There is a one-to-one correspondence between (s, t)-paths in G (paths starting at s and ending at t) and isoform candidates. For example, the path (s,1,1-4,4,4-5,5,t) corresponds to isoform 1-4-5
Fig. 2.
Fig. 2.
Flow interpretation of isoforms using the same graph as in Figure 1b. For the sake of clarity, some edges connecting s and t to internal nodes are not represented, and the length of the different bins is assumed to be equal. In (a), one unit of flow is carried along the path in red, corresponding to an isoform with abundance 1. In (b), another isoform with abundance 3 is added, yielding additional read counts at every node
Fig. 3.
Fig. 3.
Precision and recall of compared methods on simulated reads from the UCSC annotated human transcripts. (a) Single-end reads with different lengths (100, 200, 300 bp). (b) Paired-end reads with different lengths (100, 125, 150, 175 bp). (c) Single-end reads with a fixed 150 bp length and an increasing amout of material (1, 5, 10 million)
Fig. 4.
Fig. 4.
Average CPU times in milliseconds (logarithmic scale) for the compared methods with process a gene from human simulated 100 bp single-end reads
Fig. 5.
Fig. 5.
Precision and recall of compared methods on human embryonic stem cells data

References

    1. Ahuja RK, et al. Network Flows: Theory, Algorithms, and Applications. Englewood Cliffs, New Jersey: Prentice Hall; 1993.
    1. Behr J, et al. Mitie: simultaneous RNA-Seq based transcript identification and quantification in multiple samples. Bioinformatics. 2013;29:2529–2538. - PMC - PubMed
    1. Bertsekas DP. Network Optimization: Continuous and Discrete Models. Athena Scientific, Belmont, Massachusetts: Athena Scientific; 1998.
    1. Bohnert R, Rätsch G. rQuant.web: a tool for RNA-Seq-based transcript quantitation. Nucleic Acids Res. 2010;38:W348–W351. - PMC - PubMed
    1. Goldberg AV. An efficient implementation of a scaling minimum-cost flow algorithm. J. Algorithm. 1997;22:1–29.

Publication types