. 2015 Dec 15;31(24):3938-45.

doi: 10.1093/bioinformatics/btv488. Epub 2015 Sep 3.

Benchmark analysis of algorithms for determining and quantifying full-length mRNA splice forms from RNA-seq data

Katharina E Hayer¹, Angel Pizarro², Nicholas F Lahens³, John B Hogenesch³, Gregory R Grant⁴

Affiliations

¹ University of Pennsylvania, Institute for Translational Medicine and Therapeutics, Philadelphia, PA 19104.
² Scientific Computing at Amazon Web Services, Seattle, WA 98108.
³ Department of Pharmacology and.
⁴ University of Pennsylvania, Institute for Translational Medicine and Therapeutics, Philadelphia, PA 19104, Department of Genetics, University of Pennsylvania, Philadelphia, PA 19104, USA.

PMID: 26338770
PMCID: PMC4673975
DOI: 10.1093/bioinformatics/btv488

Benchmark analysis of algorithms for determining and quantifying full-length mRNA splice forms from RNA-seq data

Katharina E Hayer et al. Bioinformatics. 2015.

. 2015 Dec 15;31(24):3938-45.

doi: 10.1093/bioinformatics/btv488. Epub 2015 Sep 3.

Authors

Katharina E Hayer¹, Angel Pizarro², Nicholas F Lahens³, John B Hogenesch³, Gregory R Grant⁴

Affiliations

¹ University of Pennsylvania, Institute for Translational Medicine and Therapeutics, Philadelphia, PA 19104.
² Scientific Computing at Amazon Web Services, Seattle, WA 98108.
³ Department of Pharmacology and.
⁴ University of Pennsylvania, Institute for Translational Medicine and Therapeutics, Philadelphia, PA 19104, Department of Genetics, University of Pennsylvania, Philadelphia, PA 19104, USA.

PMID: 26338770
PMCID: PMC4673975
DOI: 10.1093/bioinformatics/btv488

Abstract

Motivation: Because of the advantages of RNA sequencing (RNA-Seq) over microarrays, it is gaining widespread popularity for highly parallel gene expression analysis. For example, RNA-Seq is expected to be able to provide accurate identification and quantification of full-length splice forms. A number of informatics packages have been developed for this purpose, but short reads make it a difficult problem in principle. Sequencing error and polymorphisms add further complications. It has become necessary to perform studies to determine which algorithms perform best and which if any algorithms perform adequately. However, there is a dearth of independent and unbiased benchmarking studies. Here we take an approach using both simulated and experimental benchmark data to evaluate their accuracy.

Results: We conclude that most methods are inaccurate even using idealized data, and that no method is highly accurate once multiple splice forms, polymorphisms, intron signal, sequencing errors, alignment errors, annotation errors and other complicating factors are present. These results point to the pressing need for further algorithm development.

Availability and implementation: Simulated datasets and other supporting information can be found at http://bioinf.itmat.upenn.edu/BEERS/bp2.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
*Left*: Shows number of mouse mm9 ENSEMBL transcripts as a function of the number of exons. 90% of transcripts have multiple exons. 65% have >5 and 35% have >10. *Right:* Distribution of the minimum number of splice forms necessary to explain the RNA-Seq junctions in 300 M read pairs of mouse Liver (Zhang *et al.*, 2014). This is based on the first 200 RefSeq genes annotated on Chromosome 1

**Fig. 2.**
These plots depict mouse liver RNA-Seq data (Zhang *et al.,* 2014). Each plot has three tracks: transcript models (bottom), depth-of-coverage (middle, red = forward, blue = reverse) and spliced reads (top, blue = annotated, green = novel, numbers give how many reads spliced cleanly across each junction). (A) Shows data for a gene with one annotated splice form. In this case the one annotated splice form is sufficient to completely explain the data. (B) A region showing several annotated genes. Here there are many reads spliced between different genes. In addition to unannotated splice junctions, there is also evidence for completely unannotated genes in this region

**Fig. 3.**
This shows the depth of coverage of a full-length cDNA clone, which has been transcribed and subjected to the Ribo-Zero (red) and PolyA selection (orange) protocols for removal of ribosomal RNA. Both protocols result in extreme local bias (Lahens *et al.,* 2014). PolyA causes 3′ bias (note this gene is oriented on the reverse strand)

**Fig. 4.**
Accuracy results for simulated dataset T1 for the methods which utilize a reference genome. This represents the most ideal case where all genes are highly expressed, there are no polymorphisms and there are no alignment errors. Splicing is divided into three types, the only cases where precision was above 90% in the first two types are when there is a single splice form. The analysis was run with gene annotation provided

**Fig. 5.**
Accuracy results for simulated datasets EP and ER. This represents the most ideal case where all genes are highly expressed, there are no polymorphisms and there are no alignment errors. Results are given separately for low, medium and high depth of coverage. Analyses were run with gene models provided

**Fig. 6.**
Accuracy results for IVT data. Analyses were run with gene models provided. Two ribosomal depletion protocols are represented. The rightmost panel shows the results on simulated data, for comparison

**Fig. 7.**
**(A)** Correlation of true FPKM with the inferred on dataset ER. Only transcripts where both the true and inferred values are positive were included. Extreme outliers were also removed. The set sizes for each correlation are given in (C) (B) Bars on the left show the number of transcripts where the true expression is zero but the algorithm assigned it positive expression, bars on the right show the number of transcripts where the true expression is positive but the algorithm assigned it zero expression. (C) This shows the number of transcripts where the true expression is positive and the algorithm gave it positive expression. The horizontal line indicates the total number of truly expressed transcripts

See this image and copyright information in PMC

References

1. Anders S., et al. (2015) HTSeq—A Python framework to work with high-throughput sequencing data. Bioinformatics 31, 166–169. - PMC - PubMed
1. Behr J., et al. (2013) MITIE: Simultaneous RNA-Seq-based transcript identification and quantification in multiple samples. Bioinformatics 29, 2529–2538. - PMC - PubMed
1. Bernard E., et al. (2014) Efficient RNA isoform identification and quantification from RNA-Seq data with network flows. Bioinformatics 30, 2447–2455. - PMC - PubMed
1. Chandramohan R., et al. (2013) Benchmarking RNA-Seq quantification tools. Conf. Proc. IEEE. Eng. Med. Biol. Soc. 2013, 647–50. - PMC - PubMed
1. Dobin A., et al. (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21. - PMC - PubMed

Publication types

Actions
Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

Grants and funding

U54HL117798/HL/NHLBI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Benchmark analysis of algorithms for determining and quantifying full-length mRNA splice forms from RNA-seq data

Affiliations

Benchmark analysis of algorithms for determining and quantifying full-length mRNA splice forms from RNA-seq data

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources