Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 May 9:15:135.
doi: 10.1186/1471-2105-15-135.

Computational approaches for isoform detection and estimation: good and bad news

Affiliations

Computational approaches for isoform detection and estimation: good and bad news

Claudia Angelini et al. BMC Bioinformatics. .

Abstract

Background: The main goal of the whole transcriptome analysis is to correctly identify all expressed transcripts within a specific cell/tissue--at a particular stage and condition--to determine their structures and to measure their abundances. RNA-seq data promise to allow identification and quantification of transcriptome at unprecedented level of resolution, accuracy and low cost. Several computational methods have been proposed to achieve such purposes. However, it is still not clear which promises are already met and which challenges are still open and require further methodological developments.

Results: We carried out a simulation study to assess the performance of 5 widely used tools, such as: CEM, Cufflinks, iReckon, RSEM, and SLIDE. All of them have been used with default parameters. In particular, we considered the effect of the following three different scenarios: the availability of complete annotation, incomplete annotation, and no annotation at all. Moreover, comparisons were carried out using the methods in three different modes of action. In the first mode, the methods were forced to only deal with those isoforms that are present in the annotation; in the second mode, they were allowed to detect novel isoforms using the annotation as guide; in the third mode, they were operating in fully data driven way (although with the support of the alignment on the reference genome). In the latter modality, precision and recall are quite poor. On the contrary, results are better with the support of the annotation, even though it is not complete. Finally, abundance estimation error often shows a very skewed distribution. The performance strongly depends on the true real abundance of the isoforms. Lowly (and sometimes also moderately) expressed isoforms are poorly detected and estimated. In particular, lowly expressed isoforms are identified mainly if they are provided in the original annotation as potential isoforms.

Conclusions: Both detection and quantification of all isoforms from RNA-seq data are still hard problems and they are affected by many factors. Overall, the performance significantly changes since it depends on the modes of action and on the type of available annotation. Results obtained using complete or partial annotation are able to detect most of the expressed isoforms, even though the number of false positives is often high. Fully data driven approaches require more attention, at least for complex eucaryotic genomes. Improvements are desirable especially for isoform quantification and for isoform detection with low abundance.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Pipeline of the simulation. Simulation workflow used both in Set-up 1 and Set-up 2. Complete annotation (CA) was given to Flux Simulator to generate strand specific PE reads (R). Reads were aligned on the reference genome using TopHat2. TopHat2 was independently used in three different ways (with CA, IA and without annotation). For each execution an alignment bam file was obtained. The alignment bam files were used as input for the compared methods. For each alignment file CEM and Cufflinks were used in Mode 1, 2 and 3; Slide was used in Mode 1 and 2, iReckon was used only in Mode 2. When providing annotation during the alignment, the same annotation was also used for Mode 1 and 2 (see green boxes for CA and pink boxes for IA). When the data driven alignment was carried out we further distinguished (in Mode 1 and 2) the cases with CA and IA, as annotation (see purple boxes). Since RSEM does not work with aligned reads, the output of Flux simulator was processed using CA or IA as annotation (depicted in a green and pink box, respectively).
Figure 2
Figure 2
Precision and Recall bar-plot in Set-up 1 for 100 bp-PE. Panels A (upper left) and B (upper right) depict precision and recall bar-plots for the compared methods when the alignment is annotation driven. Panels C (bottom left) and D (bottom right) are analogous to Panels A and B, when the alignment is data driven. The figure refers to Set-up 1 and 100 bp-PE. Within each panel, the left column refers to methods used in Mode 1, middle column to methods used in Mode 2, right column to methods used in Mode 3; upper row represents results when the annotation is CA, bottom row is the analogous case when the annotation is IA. Different bars of the same colour for the same method and mode of action correspond to the different depth (i.e., from left to right 0.25M, 0.5M, 1M, 5M, 10M and 20M). When the alignment is annotation driven, the same annotation provided during the alignment was used for Mode 1 and 2.
Figure 3
Figure 3
Precision and Recall bar-plot in Set-up 1 for 75 bp-PE. Analogous to Figure 2 but for Set-up 1 and 75 bp-PE.
Figure 4
Figure 4
Precision and Recall bar-plot in Set-up 1 for 50 bp-PE. Analogous to Figure 2 but for Set-up 1 and 50 bp-PE.
Figure 5
Figure 5
Precision and Recall bar-plot in Set-up 1 for 100 bp-SE. Analogous to Figure 2 but for Set-up 1 and 100 bp-SE.
Figure 6
Figure 6
Recall bar-plot versus isoform abundance in Set-up 1 for 100 bp-PE. Panels A (upper left) and B (upper right) depict recall bar-plots versus isoform abundance for the compared methods when the alignment is annotation driven, CA and IA case respectively. Panels C (bottom left) and D (bottom right) are analogous to Panels A and B, when the alignment is data driven. The figure refers to Set-up 1 and 100 bp-PE. Within each panel, the upper row represents the recall observed for lowly expressed isoforms, middle row for moderately expressed isoforms and bottom row for highly expressed isoforms; left column refers to methods used in Mode 1, middle column to methods used in Mode 2, right column to methods used in Mode 3. Different bars of the same colour for each method and mode of action correspond to the different depth (i.e., from left to right 0.25M, 0.5M, 1M, 5M, 10M and 20M). When the alignment is annotation driven, the same annotation provided during the alignment was used for Mode 1 and 2.
Figure 7
Figure 7
Recall bar-plot versus isoform abundance in Set-up 1 for 75 bp-PE. Analogous to Figure 6 but for Set-up 1 and 75 bp-PE.
Figure 8
Figure 8
Recall bar-plot versus isoform abundance in Set-up 1 for 50 bp-PE. Analogous to Figure 6 but for Set-up 1 and 50 bp-PE.
Figure 9
Figure 9
Recall bar-plot versus isoform abundance in Set-up 1 for 100 bp-SE. Analogous to Figure 6 but for Set-up 1 and 100 bp-SE.
Figure 10
Figure 10
True positive isoforms in Set-up 1, annotation driven alignment and Mode 2 with IA. The panels depict the number of TP isoforms detected by methods in Mode 2 when IA is provided. They are divided in those that were already present in IA (bar in purple), and those not present in IA but retrieved by the methods. The latter are further divided in low, medium and high expression classes according to their true expression level. Panels A (upper left) refers to the case of 20M 100 bp-PE, Panels B (upper right) to 0.25M 100 bp-PE. Panels C (bottom left) and D (bottom right) are analogous to Panels A and B but for SE reads. All results are obtained with annotation driven alignment.
Figure 11
Figure 11
True positive isoforms in Set-up 1, data driven alignment and Mode 2 with IA. Analogous to Figure 10 but for Set-up 1 and data driven alignment.
Figure 12
Figure 12
Precision, Recall and F-measure with thresholds (Set-up 1). Precision, Recall and F-measure for Cufflinks (left panel) and CEM (right panel). Within each set of bars, the first one (depicted in purple) reports the result for the corresponding method in Mode 1 (without any threshold), as depicted in Figure 2. The last one (depicted in yellow) refers to the same method in Mode 2, as depicted in Figure 2. The two central bars (depicted in magenta and cyan, respectively) refer to the method in Mode 1 when estimated isoforms with expression levels below 10−5 and 10−1, respectively, are set to zero. The figure refers to Set-up 1, to the case of 20M 100 bp-PE and annotation driven alignment with CA.

Similar articles

Cited by

References

    1. Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL, Barrell D, Zadissa A, Searle S, Barnes I, Bignell A, Boychenko V, Hunt T, Kay M, Mukherjee G, Rajan J, Despacio-Reyes G, Saunders G, Steward C, Harte R, Lin M, Howald C, Tanzer A, Derrien T, Chrast J, Walters N, Balasubramanian S, Pei B, Tress M. et al.GENCODE: the reference human genome annotation for The ENCODE project. Genome Res. 2012;22(9):1760–1774. - PMC - PubMed
    1. Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A, Tanzer A, Lagarde J, Lin W, Schlesinger F, Xue C, Marinov GK, Khatun J, Williams BA, Zaleski C, Rozowsky J, Röder M, Kokocinski F, Abdelhamid RF, Alioto T, Antoshechkin I, Baer MT, Bar NS, Batut P, Bell K, Bell I, Chakrabortty S, Chen X, Chrast J, Curado J. et al.Landscape of transcription in human cells. Nature. 2012;489:101–108. - PMC - PubMed
    1. Costa V, Aprile M, Esposito R, Ciccodicola A. RNA-Seq and human complex diseases: recent accomplishments and future perspectives. Eur J Hum Genet. 2013;21:134–142. - PMC - PubMed
    1. Marioni J, Mason C, Mane S, Stephens M, Gilad Y. RNA-Seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008;18:1509–1517. - PMC - PubMed
    1. Mortazavi A, Williams B, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5:621–628. - PubMed

Publication types