Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Dec 15;31(24):3938-45.
doi: 10.1093/bioinformatics/btv488. Epub 2015 Sep 3.

Benchmark analysis of algorithms for determining and quantifying full-length mRNA splice forms from RNA-seq data

Affiliations

Benchmark analysis of algorithms for determining and quantifying full-length mRNA splice forms from RNA-seq data

Katharina E Hayer et al. Bioinformatics. .

Abstract

Motivation: Because of the advantages of RNA sequencing (RNA-Seq) over microarrays, it is gaining widespread popularity for highly parallel gene expression analysis. For example, RNA-Seq is expected to be able to provide accurate identification and quantification of full-length splice forms. A number of informatics packages have been developed for this purpose, but short reads make it a difficult problem in principle. Sequencing error and polymorphisms add further complications. It has become necessary to perform studies to determine which algorithms perform best and which if any algorithms perform adequately. However, there is a dearth of independent and unbiased benchmarking studies. Here we take an approach using both simulated and experimental benchmark data to evaluate their accuracy.

Results: We conclude that most methods are inaccurate even using idealized data, and that no method is highly accurate once multiple splice forms, polymorphisms, intron signal, sequencing errors, alignment errors, annotation errors and other complicating factors are present. These results point to the pressing need for further algorithm development.

Availability and implementation: Simulated datasets and other supporting information can be found at http://bioinf.itmat.upenn.edu/BEERS/bp2.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Left: Shows number of mouse mm9 ENSEMBL transcripts as a function of the number of exons. 90% of transcripts have multiple exons. 65% have >5 and 35% have >10. Right: Distribution of the minimum number of splice forms necessary to explain the RNA-Seq junctions in 300 M read pairs of mouse Liver (Zhang et al., 2014). This is based on the first 200 RefSeq genes annotated on Chromosome 1
Fig. 2.
Fig. 2.
These plots depict mouse liver RNA-Seq data (Zhang et al., 2014). Each plot has three tracks: transcript models (bottom), depth-of-coverage (middle, red = forward, blue = reverse) and spliced reads (top, blue = annotated, green = novel, numbers give how many reads spliced cleanly across each junction). (A) Shows data for a gene with one annotated splice form. In this case the one annotated splice form is sufficient to completely explain the data. (B) A region showing several annotated genes. Here there are many reads spliced between different genes. In addition to unannotated splice junctions, there is also evidence for completely unannotated genes in this region
Fig. 3.
Fig. 3.
This shows the depth of coverage of a full-length cDNA clone, which has been transcribed and subjected to the Ribo-Zero (red) and PolyA selection (orange) protocols for removal of ribosomal RNA. Both protocols result in extreme local bias (Lahens et al., 2014). PolyA causes 3′ bias (note this gene is oriented on the reverse strand)
Fig. 4.
Fig. 4.
Accuracy results for simulated dataset T1 for the methods which utilize a reference genome. This represents the most ideal case where all genes are highly expressed, there are no polymorphisms and there are no alignment errors. Splicing is divided into three types, the only cases where precision was above 90% in the first two types are when there is a single splice form. The analysis was run with gene annotation provided
Fig. 5.
Fig. 5.
Accuracy results for simulated datasets EP and ER. This represents the most ideal case where all genes are highly expressed, there are no polymorphisms and there are no alignment errors. Results are given separately for low, medium and high depth of coverage. Analyses were run with gene models provided
Fig. 6.
Fig. 6.
Accuracy results for IVT data. Analyses were run with gene models provided. Two ribosomal depletion protocols are represented. The rightmost panel shows the results on simulated data, for comparison
Fig. 7.
Fig. 7.
(A) Correlation of true FPKM with the inferred on dataset ER. Only transcripts where both the true and inferred values are positive were included. Extreme outliers were also removed. The set sizes for each correlation are given in (C) (B) Bars on the left show the number of transcripts where the true expression is zero but the algorithm assigned it positive expression, bars on the right show the number of transcripts where the true expression is positive but the algorithm assigned it zero expression. (C) This shows the number of transcripts where the true expression is positive and the algorithm gave it positive expression. The horizontal line indicates the total number of truly expressed transcripts

References

    1. Anders S., et al. (2015) HTSeq—A Python framework to work with high-throughput sequencing data. Bioinformatics 31, 166–169. - PMC - PubMed
    1. Behr J., et al. (2013) MITIE: Simultaneous RNA-Seq-based transcript identification and quantification in multiple samples. Bioinformatics 29, 2529–2538. - PMC - PubMed
    1. Bernard E., et al. (2014) Efficient RNA isoform identification and quantification from RNA-Seq data with network flows. Bioinformatics 30, 2447–2455. - PMC - PubMed
    1. Chandramohan R., et al. (2013) Benchmarking RNA-Seq quantification tools. Conf. Proc. IEEE. Eng. Med. Biol. Soc. 2013, 647–50. - PMC - PubMed
    1. Dobin A., et al. (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21. - PMC - PubMed

Publication types