Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Dec 23;111(51):E5593-601.
doi: 10.1073/pnas.1419161111. Epub 2014 Dec 5.

rMATS: robust and flexible detection of differential alternative splicing from replicate RNA-Seq data

Affiliations

rMATS: robust and flexible detection of differential alternative splicing from replicate RNA-Seq data

Shihao Shen et al. Proc Natl Acad Sci U S A. .

Abstract

Ultra-deep RNA sequencing (RNA-Seq) has become a powerful approach for genome-wide analysis of pre-mRNA alternative splicing. We previously developed multivariate analysis of transcript splicing (MATS), a statistical method for detecting differential alternative splicing between two RNA-Seq samples. Here we describe a new statistical model and computer program, replicate MATS (rMATS), designed for detection of differential alternative splicing from replicate RNA-Seq data. rMATS uses a hierarchical model to simultaneously account for sampling uncertainty in individual replicates and variability among replicates. In addition to the analysis of unpaired replicates, rMATS also includes a model specifically designed for paired replicates between sample groups. The hypothesis-testing framework of rMATS is flexible and can assess the statistical significance over any user-defined magnitude of splicing change. The performance of rMATS is evaluated by the analysis of simulated and real RNA-Seq data. rMATS outperformed two existing methods for replicate RNA-Seq data in all simulation settings, and RT-PCR yielded a high validation rate (94%) in an RNA-Seq dataset of prostate cancer cell lines. Our data also provide guiding principles for designing RNA-Seq studies of alternative splicing. We demonstrate that it is essential to incorporate biological replicates in the study design. Of note, pooling RNAs or merging RNA-Seq data from multiple replicates is not an effective approach to account for variability, and the result is particularly sensitive to outliers. The rMATS source code is freely available at rnaseq-mats.sourceforge.net/. As the popularity of RNA-Seq continues to grow, we expect rMATS will be useful for studies of alternative splicing in diverse RNA-Seq projects.

Keywords: RNA sequencing; alternative splicing; exon; isoform; transcriptome.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig. 1.
Fig. 1.
The schematic diagram of an exon skipping event. The exon inclusion reads (I) are the reads from the upstream splice junction, the alternative exon itself, and the downstream splice junction. The exon skipping reads (S) are the reads from the skipping splice junction that directly connects the upstream exon to the downstream exon.
Fig. 2.
Fig. 2.
The statistical framework of the unpaired rMATS model. For exon i and the kth replicate, the total RNA-Seq read counts for the exon inclusion and skipping isoforms are denoted as ni1k,ni2k for sample groups 1 and 2, respectively. The read counts for the exon inclusion isoform are denoted as Ii1k,Ii2k. The exon inclusion levels are denoted as ψi1k,ψi2k. The proportion of the read count from the exon inclusion isoform is adjusted by a normalization function fi that considers the lengths of the exon inclusion and skipping isoforms. rMATS uses a binomial distribution to model the read count from the exon inclusion isoform given the exon inclusion level in each individual replicate and a logit-normal distribution to model the variation among replicates within sample group. The mean and variance of exon inclusion levels in the two sample groups are denoted as ψi1,ψi2 and σi12,σi22. A likelihood-ratio test is used to calculate the P value that the difference between ψi1 and ψi2 exceeds a given user-defined threshold c.
Fig. 3.
Fig. 3.
Simulation studies to assess the performance of rMATS and the importance of replicates. We simulated 5,000 exons, where 5% of the exons were differentially spliced and the rest were not differentially spliced. We simulated five replicates in each sample group. The exon inclusion levels in individual replicates were simulated from a normal distribution with different SDs in three different studies: (A) SD = 0.01, (B) SD = 0.02, and (C) SD = 0.05. To assess the effect of outliers, we also simulated an additional dataset where 1 of the 10 replicates (of the two sample groups) had a large SD of 0.2. In all scenarios, the analysis by rMATS on replicate data always outperformed the analysis of pooled data (without the information of replicates), as indicated by the receiver operating characteristic (ROC) curves. In addition, the analyses on replicate data were more robust against outliers because rMATS modeled the variation within sample groups, whereas the ROC curves of the pooled data (without the information of replicates) were heavily influenced by outliers.
Fig. 4.
Fig. 4.
An example of rMATS unpaired analysis of prostate cancer cell lines. (A) The RNA-Seq read counts and estimated exon inclusion levels of ARHGAP17 exon 15 in a pair of epithelial (PC3E) and mesenchymal (GS689) cell lines, each with three biological replicates. (B) The log likelihood of observing the data given all possible combinations of ψi1,ψi2. In this likelihood-ratio test, the null hypothesis is |ψi1ψi2|5% and the alternative hypothesis is |ψi1ψi2|>5%. The combination of ψi1,ψi2 that maximizes the likelihood of observing the data under the constraint of the null hypothesis or without such a constraint is indicated.
Fig. 5.
Fig. 5.
Simulation studies to assess the influence of sample size and sequencing depth on detection accuracy. We simulated five different SDs (SD = 0.01, 0.02, 0.05, 0.10, and 0.20) within the sample group. The true positive rate at 5% false positive rate was calculated and plotted for each set of simulated data. (A) A total of 200 million paired-end reads were simulated for each sample group and distributed among 3–10 replicates. (B) A total of 1.6 billion paired-end reads were simulated for each sample group and distributed among 3–10 replicates.
Fig. 6.
Fig. 6.
The statistical framework of the paired rMATS model. Each replicate in sample group 1 is paired with another replicate in sample group 2. For exon i and the kth replicate, the total RNA-Seq read counts for the exon inclusion and skipping isoforms are denoted as ni1k,ni2k for sample groups 1 and 2, respectively. The read counts for the exon inclusion isoform are denoted as Ii1k,Ii2k. The exon inclusion levels are denoted as ψi1k,ψi2k. The proportion of the read count from the exon inclusion isoform is adjusted by a normalization function fi that considers the lengths of the exon inclusion and skipping isoforms. rMATS uses a bivariate normal distribution to model the variation among replicates within sample group and the correlation between paired replicates. The mean and variance of exon inclusion levels in the two sample groups are denoted as ψi1,ψi2 and σi12,σi22. The correlation parameter is denoted as ρi.
Fig. 7.
Fig. 7.
A comparison between paired and unpaired rMATS models. (A) We applied paired and unpaired rMATS to 65 tumor-normal matched pairs in TCGA RNA-Seq data of clear cell renal cell carcinoma (ccRCC). For 315 exons with FDR ≤ 1% by either model, we compared the P values from the paired and unpaired models, relative to the estimated correlation parameter ρ between pairs. Paired rMATS produced smaller (more significant) P values, especially for exons with a high degree of correlation among matched pairs. The 11 exons unique to paired rMATS are marked with red circles. (B) The mean and the SEM of Δψ (between tumor-normal matched pairs) for the 11 exons unique to paired rMATS and 11 randomly selected exons common to both paired and unpaired rMATS.
Fig. 8.
Fig. 8.
(A–C) Simulation studies to compare the performance of rMATS, Cufflinks, and DiffSplice. The performances of these methods on a simulated dataset are indicated by their respective receiver operating characteristic (ROC) curves. Insets (Bottom Right) highlight the most critical area of the ROC curve where the false positive rate is low (<0.2). (A) Five replicates were simulated for each sample group. The exon inclusion levels in individual replicates were simulated from a normal distribution with SD = 0.05. The read counts were sampled from TCGA ccRCC RNA-Seq data. (B) One of the replicates was randomly set to have only 10% of the typical read coverage. (C) One of the replicates was randomly set as an outlier with large SD of exon inclusion levels (SD = 0.2).

Similar articles

Cited by

References

    1. Keren H, Lev-Maor G, Ast G. Alternative splicing and evolution: Diversification, exon definition and function. Nat Rev Genet. 2010;11(5):345–355. - PubMed
    1. Nilsen TW, Graveley BR. Expansion of the eukaryotic proteome by alternative splicing. Nature. 2010;463(7280):457–463. - PMC - PubMed
    1. Wang ET, et al. Alternative isoform regulation in human tissue transcriptomes. Nature. 2008;456(7221):470–476. - PMC - PubMed
    1. Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat Genet. 2008;40(12):1413–1415. - PubMed
    1. Wang GS, Cooper TA. Splicing in disease: Disruption of the splicing code and the decoding machinery. Nat Rev Genet. 2007;8(10):749–761. - PubMed

Publication types

Associated data