Computational approaches for isoform detection and estimation: good and bad news

doi:10.1186/1471-2105-15-135

. 2014 May 9:15:135.

doi: 10.1186/1471-2105-15-135.

Computational approaches for isoform detection and estimation: good and bad news

Claudia Angelini¹, Daniela De Canditiis, Italia De Feis

Affiliations

PMID: 24885830
PMCID: PMC4098781
DOI: 10.1186/1471-2105-15-135

Computational approaches for isoform detection and estimation: good and bad news

Claudia Angelini et al. BMC Bioinformatics. 2014.

. 2014 May 9:15:135.

doi: 10.1186/1471-2105-15-135.

Authors

Claudia Angelini¹, Daniela De Canditiis, Italia De Feis

Affiliation

¹ Istituto per le Applicazioni del Calcolo, CNR, Naples, Italy. c.angelini@iac.cnr.it.

PMID: 24885830
PMCID: PMC4098781
DOI: 10.1186/1471-2105-15-135

Abstract

Background: The main goal of the whole transcriptome analysis is to correctly identify all expressed transcripts within a specific cell/tissue--at a particular stage and condition--to determine their structures and to measure their abundances. RNA-seq data promise to allow identification and quantification of transcriptome at unprecedented level of resolution, accuracy and low cost. Several computational methods have been proposed to achieve such purposes. However, it is still not clear which promises are already met and which challenges are still open and require further methodological developments.

Results: We carried out a simulation study to assess the performance of 5 widely used tools, such as: CEM, Cufflinks, iReckon, RSEM, and SLIDE. All of them have been used with default parameters. In particular, we considered the effect of the following three different scenarios: the availability of complete annotation, incomplete annotation, and no annotation at all. Moreover, comparisons were carried out using the methods in three different modes of action. In the first mode, the methods were forced to only deal with those isoforms that are present in the annotation; in the second mode, they were allowed to detect novel isoforms using the annotation as guide; in the third mode, they were operating in fully data driven way (although with the support of the alignment on the reference genome). In the latter modality, precision and recall are quite poor. On the contrary, results are better with the support of the annotation, even though it is not complete. Finally, abundance estimation error often shows a very skewed distribution. The performance strongly depends on the true real abundance of the isoforms. Lowly (and sometimes also moderately) expressed isoforms are poorly detected and estimated. In particular, lowly expressed isoforms are identified mainly if they are provided in the original annotation as potential isoforms.

Conclusions: Both detection and quantification of all isoforms from RNA-seq data are still hard problems and they are affected by many factors. Overall, the performance significantly changes since it depends on the modes of action and on the type of available annotation. Results obtained using complete or partial annotation are able to detect most of the expressed isoforms, even though the number of false positives is often high. Fully data driven approaches require more attention, at least for complex eucaryotic genomes. Improvements are desirable especially for isoform quantification and for isoform detection with low abundance.

PubMed Disclaimer

Figures

**Figure 1**
**Pipeline of the simulation.** Simulation workflow used both in Set-up 1 and Set-up 2. Complete annotation (CA) was given to Flux Simulator to generate strand specific PE reads (R). Reads were aligned on the reference genome using TopHat2. TopHat2 was independently used in three different ways (with CA, IA and without annotation). For each execution an alignment bam file was obtained. The alignment bam files were used as input for the compared methods. For each alignment file CEM and Cufflinks were used in Mode 1, 2 and 3; Slide was used in Mode 1 and 2, iReckon was used only in Mode 2. When providing annotation during the alignment, the same annotation was also used for Mode 1 and 2 (see green boxes for CA and pink boxes for IA). When the data driven alignment was carried out we further distinguished (in Mode 1 and 2) the cases with CA and IA, as annotation (see purple boxes). Since RSEM does not work with aligned reads, the output of Flux simulator was processed using CA or IA as annotation (depicted in a green and pink box, respectively).

**Figure 2**
**Precision and Recall bar-plot in Set-up 1 for 100 bp-PE.** Panels A (upper left) and B (upper right) depict precision and recall bar-plots for the compared methods when the alignment is annotation driven. Panels C (bottom left) and D (bottom right) are analogous to Panels A and B, when the alignment is data driven. The figure refers to Set-up 1 and 100 bp-PE. Within each panel, the left column refers to methods used in Mode 1, middle column to methods used in Mode 2, right column to methods used in Mode 3; upper row represents results when the annotation is CA, bottom row is the analogous case when the annotation is IA. Different bars of the same colour for the same method and mode of action correspond to the different depth (i.e., from left to right 0.25M, 0.5M, 1M, 5M, 10M and 20M). When the alignment is annotation driven, the same annotation provided during the alignment was used for Mode 1 and 2.

**Figure 3**
**Precision and Recall bar-plot in Set-up 1 for 75 bp-PE.** Analogous to Figure 2 but for Set-up 1 and 75 bp-PE.

**Figure 4**
**Precision and Recall bar-plot in Set-up 1 for 50 bp-PE.** Analogous to Figure 2 but for Set-up 1 and 50 bp-PE.

**Figure 5**
**Precision and Recall bar-plot in Set-up 1 for 100 bp-SE.** Analogous to Figure 2 but for Set-up 1 and 100 bp-SE.

**Figure 6**
**Recall bar-plot versus isoform abundance in Set-up 1 for 100 bp-PE.** Panels A (upper left) and B (upper right) depict recall bar-plots versus isoform abundance for the compared methods when the alignment is annotation driven, CA and IA case respectively. Panels C (bottom left) and D (bottom right) are analogous to Panels A and B, when the alignment is data driven. The figure refers to Set-up 1 and 100 bp-PE. Within each panel, the upper row represents the recall observed for lowly expressed isoforms, middle row for moderately expressed isoforms and bottom row for highly expressed isoforms; left column refers to methods used in Mode 1, middle column to methods used in Mode 2, right column to methods used in Mode 3. Different bars of the same colour for each method and mode of action correspond to the different depth (i.e., from left to right 0.25M, 0.5M, 1M, 5M, 10M and 20M). When the alignment is annotation driven, the same annotation provided during the alignment was used for Mode 1 and 2.

**Figure 7**
**Recall bar-plot versus isoform abundance in Set-up 1 for 75 bp-PE.** Analogous to Figure 6 but for Set-up 1 and 75 bp-PE.

**Figure 8**
**Recall bar-plot versus isoform abundance in Set-up 1 for 50 bp-PE.** Analogous to Figure 6 but for Set-up 1 and 50 bp-PE.

**Figure 9**
**Recall bar-plot versus isoform abundance in Set-up 1 for 100 bp-SE.** Analogous to Figure 6 but for Set-up 1 and 100 bp-SE.

**Figure 10**
**True positive isoforms in Set-up 1, annotation driven alignment and Mode 2 with IA.** The panels depict the number of TP isoforms detected by methods in Mode 2 when IA is provided. They are divided in those that were already present in IA (bar in purple), and those not present in IA but retrieved by the methods. The latter are further divided in low, medium and high expression classes according to their true expression level. Panels A (upper left) refers to the case of 20M 100 bp-PE, Panels B (upper right) to 0.25M 100 bp-PE. Panels C (bottom left) and D (bottom right) are analogous to Panels A and B but for SE reads. All results are obtained with annotation driven alignment.

**Figure 11**
**True positive isoforms in Set-up 1, data driven alignment and Mode 2 with IA.** Analogous to Figure 10 but for Set-up 1 and data driven alignment.

**Figure 12**
**Precision, Recall and F-measure with thresholds (Set-up 1).** Precision, Recall and F-measure for Cufflinks (left panel) and CEM (right panel). Within each set of bars, the first one (depicted in purple) reports the result for the corresponding method in Mode 1 (without any threshold), as depicted in Figure 2. The last one (depicted in yellow) refers to the same method in Mode 2, as depicted in Figure 2. The two central bars (depicted in magenta and cyan, respectively) refer to the method in Mode 1 when estimated isoforms with expression levels below 10⁻⁵ and 10⁻¹, respectively, are set to zero. The figure refers to Set-up 1, to the case of 20M 100 bp-PE and annotation driven alignment with CA.

See this image and copyright information in PMC

Cited by

Comparative assessment of methods for the computational inference of transcript isoform abundance from RNA-seq data.
Kanitz A, Gypas F, Gruber AJ, Gruber AR, Martin G, Zavolan M. Kanitz A, et al. Genome Biol. 2015 Jul 23;16(1):150. doi: 10.1186/s13059-015-0702-5. Genome Biol. 2015. PMID: 26201343 Free PMC article.
Union Exon Based Approach for RNA-Seq Gene Quantification: To Be or Not to Be?
Zhao S, Xi L, Zhang B. Zhao S, et al. PLoS One. 2015 Nov 11;10(11):e0141910. doi: 10.1371/journal.pone.0141910. eCollection 2015. PLoS One. 2015. PMID: 26559532 Free PMC article.
Using Synthetic Mouse Spike-In Transcripts to Evaluate RNA-Seq Analysis Tools.
Leshkowitz D, Feldmesser E, Friedlander G, Jona G, Ainbinder E, Parmet Y, Horn-Saban S. Leshkowitz D, et al. PLoS One. 2016 Apr 21;11(4):e0153782. doi: 10.1371/journal.pone.0153782. eCollection 2016. PLoS One. 2016. PMID: 27100792 Free PMC article.
Pervasive isoform-specific translational regulation via alternative transcription start sites in mammals.
Wang X, Hou J, Quedenau C, Chen W. Wang X, et al. Mol Syst Biol. 2016 Jul 18;12(7):875. doi: 10.15252/msb.20166941. Mol Syst Biol. 2016. PMID: 27430939 Free PMC article.
Understanding gene regulatory mechanisms by integrating ChIP-seq and RNA-seq data: statistical solutions to biological problems.
Angelini C, Costa V. Angelini C, et al. Front Cell Dev Biol. 2014 Sep 17;2:51. doi: 10.3389/fcell.2014.00051. eCollection 2014. Front Cell Dev Biol. 2014. PMID: 25364758 Free PMC article.

See all "Cited by" articles

References

1. Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL, Barrell D, Zadissa A, Searle S, Barnes I, Bignell A, Boychenko V, Hunt T, Kay M, Mukherjee G, Rajan J, Despacio-Reyes G, Saunders G, Steward C, Harte R, Lin M, Howald C, Tanzer A, Derrien T, Chrast J, Walters N, Balasubramanian S, Pei B, Tress M. et al.GENCODE: the reference human genome annotation for The ENCODE project. Genome Res. 2012;22(9):1760–1774. - PMC - PubMed
1. Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A, Tanzer A, Lagarde J, Lin W, Schlesinger F, Xue C, Marinov GK, Khatun J, Williams BA, Zaleski C, Rozowsky J, Röder M, Kokocinski F, Abdelhamid RF, Alioto T, Antoshechkin I, Baer MT, Bar NS, Batut P, Bell K, Bell I, Chakrabortty S, Chen X, Chrast J, Curado J. et al.Landscape of transcription in human cells. Nature. 2012;489:101–108. - PMC - PubMed
1. Costa V, Aprile M, Esposito R, Ciccodicola A. RNA-Seq and human complex diseases: recent accomplishments and future perspectives. Eur J Hum Genet. 2013;21:134–142. - PMC - PubMed
1. Marioni J, Mason C, Mane S, Stephens M, Gilad Y. RNA-Seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008;18:1509–1517. - PMC - PubMed
1. Mortazavi A, Williams B, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5:621–628. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

[1] Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL, Barrell D, Zadissa A, Searle S, Barnes I, Bignell A, Boychenko V, Hunt T, Kay M, Mukherjee G, Rajan J, Despacio-Reyes G, Saunders G, Steward C, Harte R, Lin M, Howald C, Tanzer A, Derrien T, Chrast J, Walters N, Balasubramanian S, Pei B, Tress M. et al.GENCODE: the reference human genome annotation for The ENCODE project. Genome Res. 2012;22(9):1760–1774. - PMC - PubMed

[2] Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL, Barrell D, Zadissa A, Searle S, Barnes I, Bignell A, Boychenko V, Hunt T, Kay M, Mukherjee G, Rajan J, Despacio-Reyes G, Saunders G, Steward C, Harte R, Lin M, Howald C, Tanzer A, Derrien T, Chrast J, Walters N, Balasubramanian S, Pei B, Tress M. et al.GENCODE: the reference human genome annotation for The ENCODE project. Genome Res. 2012;22(9):1760–1774. - PMC - PubMed

[3] Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A, Tanzer A, Lagarde J, Lin W, Schlesinger F, Xue C, Marinov GK, Khatun J, Williams BA, Zaleski C, Rozowsky J, Röder M, Kokocinski F, Abdelhamid RF, Alioto T, Antoshechkin I, Baer MT, Bar NS, Batut P, Bell K, Bell I, Chakrabortty S, Chen X, Chrast J, Curado J. et al.Landscape of transcription in human cells. Nature. 2012;489:101–108. - PMC - PubMed

[4] Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A, Tanzer A, Lagarde J, Lin W, Schlesinger F, Xue C, Marinov GK, Khatun J, Williams BA, Zaleski C, Rozowsky J, Röder M, Kokocinski F, Abdelhamid RF, Alioto T, Antoshechkin I, Baer MT, Bar NS, Batut P, Bell K, Bell I, Chakrabortty S, Chen X, Chrast J, Curado J. et al.Landscape of transcription in human cells. Nature. 2012;489:101–108. - PMC - PubMed

[5] Costa V, Aprile M, Esposito R, Ciccodicola A. RNA-Seq and human complex diseases: recent accomplishments and future perspectives. Eur J Hum Genet. 2013;21:134–142. - PMC - PubMed

[6] Costa V, Aprile M, Esposito R, Ciccodicola A. RNA-Seq and human complex diseases: recent accomplishments and future perspectives. Eur J Hum Genet. 2013;21:134–142. - PMC - PubMed

[7] Marioni J, Mason C, Mane S, Stephens M, Gilad Y. RNA-Seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008;18:1509–1517. - PMC - PubMed

[8] Marioni J, Mason C, Mane S, Stephens M, Gilad Y. RNA-Seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008;18:1509–1517. - PMC - PubMed

[9] Mortazavi A, Williams B, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5:621–628. - PubMed

[10] Mortazavi A, Williams B, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5:621–628. - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Computational approaches for isoform detection and estimation: good and bad news

Affiliation

Computational approaches for isoform detection and estimation: good and bad news

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials

Miscellaneous