. 2014 Dec 23;111(51):E5593-601.

doi: 10.1073/pnas.1419161111. Epub 2014 Dec 5.

rMATS: robust and flexible detection of differential alternative splicing from replicate RNA-Seq data

Shihao Shen¹, Juw Won Park¹, Zhi-xiang Lu¹, Lan Lin¹, Michael D Henry², Ying Nian Wu³, Qing Zhou³, Yi Xing⁴

Affiliations

¹ Departments of Microbiology, Immunology, & Molecular Genetics and.
² Departments of Molecular Physiology and Biophysics and Pathology, University of Iowa, Iowa City, IA 52242.
³ Statistics, University of California, Los Angeles, CA 90095; and.
⁴ Departments of Microbiology, Immunology, & Molecular Genetics and yxing@ucla.edu.

PMID: 25480548
PMCID: PMC4280593
DOI: 10.1073/pnas.1419161111

rMATS: robust and flexible detection of differential alternative splicing from replicate RNA-Seq data

Shihao Shen et al. Proc Natl Acad Sci U S A. 2014.

. 2014 Dec 23;111(51):E5593-601.

doi: 10.1073/pnas.1419161111. Epub 2014 Dec 5.

Authors

Shihao Shen¹, Juw Won Park¹, Zhi-xiang Lu¹, Lan Lin¹, Michael D Henry², Ying Nian Wu³, Qing Zhou³, Yi Xing⁴

Affiliations

¹ Departments of Microbiology, Immunology, & Molecular Genetics and.
² Departments of Molecular Physiology and Biophysics and Pathology, University of Iowa, Iowa City, IA 52242.
³ Statistics, University of California, Los Angeles, CA 90095; and.
⁴ Departments of Microbiology, Immunology, & Molecular Genetics and yxing@ucla.edu.

PMID: 25480548
PMCID: PMC4280593
DOI: 10.1073/pnas.1419161111

Abstract

Ultra-deep RNA sequencing (RNA-Seq) has become a powerful approach for genome-wide analysis of pre-mRNA alternative splicing. We previously developed multivariate analysis of transcript splicing (MATS), a statistical method for detecting differential alternative splicing between two RNA-Seq samples. Here we describe a new statistical model and computer program, replicate MATS (rMATS), designed for detection of differential alternative splicing from replicate RNA-Seq data. rMATS uses a hierarchical model to simultaneously account for sampling uncertainty in individual replicates and variability among replicates. In addition to the analysis of unpaired replicates, rMATS also includes a model specifically designed for paired replicates between sample groups. The hypothesis-testing framework of rMATS is flexible and can assess the statistical significance over any user-defined magnitude of splicing change. The performance of rMATS is evaluated by the analysis of simulated and real RNA-Seq data. rMATS outperformed two existing methods for replicate RNA-Seq data in all simulation settings, and RT-PCR yielded a high validation rate (94%) in an RNA-Seq dataset of prostate cancer cell lines. Our data also provide guiding principles for designing RNA-Seq studies of alternative splicing. We demonstrate that it is essential to incorporate biological replicates in the study design. Of note, pooling RNAs or merging RNA-Seq data from multiple replicates is not an effective approach to account for variability, and the result is particularly sensitive to outliers. The rMATS source code is freely available at rnaseq-mats.sourceforge.net/. As the popularity of RNA-Seq continues to grow, we expect rMATS will be useful for studies of alternative splicing in diverse RNA-Seq projects.

Keywords: RNA sequencing; alternative splicing; exon; isoform; transcriptome.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

**Fig. 1.**
The schematic diagram of an exon skipping event. The exon inclusion reads $(I)$ are the reads from the upstream splice junction, the alternative exon itself, and the downstream splice junction. The exon skipping reads $(S)$ are the reads from the skipping splice junction that directly connects the upstream exon to the downstream exon.

**Fig. 2.**
The statistical framework of the unpaired rMATS model. For exon $i$ and the $k th$ replicate, the total RNA-Seq read counts for the exon inclusion and skipping isoforms are denoted as $n_{i 1 k}, n_{i 2 k}$ for sample groups 1 and 2, respectively. The read counts for the exon inclusion isoform are denoted as $I_{i 1 k}, I_{i 2 k}$ . The exon inclusion levels are denoted as $ψ_{i 1 k}, ψ_{i 2 k}$ . The proportion of the read count from the exon inclusion isoform is adjusted by a normalization function $f_{i}$ that considers the lengths of the exon inclusion and skipping isoforms. rMATS uses a binomial distribution to model the read count from the exon inclusion isoform given the exon inclusion level in each individual replicate and a logit-normal distribution to model the variation among replicates within sample group. The mean and variance of exon inclusion levels in the two sample groups are denoted as $ψ_{i 1}, ψ_{i 2}$ and $σ_{i 1}^{2}, σ_{i 2}^{2}$ . A likelihood-ratio test is used to calculate the P value that the difference between $ψ_{i 1}$ and $ψ_{i 2}$ exceeds a given user-defined threshold $c$ .

**Fig. 3.**
Simulation studies to assess the performance of rMATS and the importance of replicates. We simulated 5,000 exons, where 5% of the exons were differentially spliced and the rest were not differentially spliced. We simulated five replicates in each sample group. The exon inclusion levels in individual replicates were simulated from a normal distribution with different SDs in three different studies: (A) SD = 0.01, (B) SD = 0.02, and (C) SD = 0.05. To assess the effect of outliers, we also simulated an additional dataset where 1 of the 10 replicates (of the two sample groups) had a large SD of 0.2. In all scenarios, the analysis by rMATS on replicate data always outperformed the analysis of pooled data (without the information of replicates), as indicated by the receiver operating characteristic (ROC) curves. In addition, the analyses on replicate data were more robust against outliers because rMATS modeled the variation within sample groups, whereas the ROC curves of the pooled data (without the information of replicates) were heavily influenced by outliers.

**Fig. 4.**
An example of rMATS unpaired analysis of prostate cancer cell lines. (A) The RNA-Seq read counts and estimated exon inclusion levels of *ARHGAP17* exon 15 in a pair of epithelial (PC3E) and mesenchymal (GS689) cell lines, each with three biological replicates. (B) The log likelihood of observing the data given all possible combinations of $ψ_{i 1}, ψ_{i 2}$ . In this likelihood-ratio test, the null hypothesis is $| ψ_{i 1} - ψ_{i 2} | \leq 5 %$ and the alternative hypothesis is $| ψ_{i 1} - ψ_{i 2} | > 5 %$ . The combination of $ψ_{i 1}, ψ_{i 2}$ that maximizes the likelihood of observing the data under the constraint of the null hypothesis or without such a constraint is indicated.

**Fig. 5.**
Simulation studies to assess the influence of sample size and sequencing depth on detection accuracy. We simulated five different SDs (SD = 0.01, 0.02, 0.05, 0.10, and 0.20) within the sample group. The true positive rate at 5% false positive rate was calculated and plotted for each set of simulated data. (A) A total of 200 million paired-end reads were simulated for each sample group and distributed among 3–10 replicates. (B) A total of 1.6 billion paired-end reads were simulated for each sample group and distributed among 3–10 replicates.

**Fig. 6.**
The statistical framework of the paired rMATS model. Each replicate in sample group 1 is paired with another replicate in sample group 2. For exon $i$ and the $k th$ replicate, the total RNA-Seq read counts for the exon inclusion and skipping isoforms are denoted as $n_{i 1 k}, n_{i 2 k}$ for sample groups 1 and 2, respectively. The read counts for the exon inclusion isoform are denoted as $I_{i 1 k}, I_{i 2 k}$ . The exon inclusion levels are denoted as $ψ_{i 1 k}, ψ_{i 2 k}$ . The proportion of the read count from the exon inclusion isoform is adjusted by a normalization function $f_{i}$ that considers the lengths of the exon inclusion and skipping isoforms. rMATS uses a bivariate normal distribution to model the variation among replicates within sample group and the correlation between paired replicates. The mean and variance of exon inclusion levels in the two sample groups are denoted as $ψ_{i 1}, ψ_{i 2}$ and $σ_{i 1}^{2}, σ_{i 2}^{2}$ . The correlation parameter is denoted as $ρ_{i}$ .

**Fig. 7.**
A comparison between paired and unpaired rMATS models. (A) We applied paired and unpaired rMATS to 65 tumor-normal matched pairs in TCGA RNA-Seq data of clear cell renal cell carcinoma (ccRCC). For 315 exons with FDR ≤ 1% by either model, we compared the P values from the paired and unpaired models, relative to the estimated correlation parameter $ρ$ between pairs. Paired rMATS produced smaller (more significant) P values, especially for exons with a high degree of correlation among matched pairs. The 11 exons unique to paired rMATS are marked with red circles. (B) The mean and the SEM of Δψ (between tumor-normal matched pairs) for the 11 exons unique to paired rMATS and 11 randomly selected exons common to both paired and unpaired rMATS.

**Fig. 8.**
(*A–C*) Simulation studies to compare the performance of rMATS, Cufflinks, and DiffSplice. The performances of these methods on a simulated dataset are indicated by their respective receiver operating characteristic (ROC) curves. *Insets* (*Bottom Right*) highlight the most critical area of the ROC curve where the false positive rate is low (<0.2). (A) Five replicates were simulated for each sample group. The exon inclusion levels in individual replicates were simulated from a normal distribution with SD = 0.05. The read counts were sampled from TCGA ccRCC RNA-Seq data. (B) One of the replicates was randomly set to have only 10% of the typical read coverage. (C) One of the replicates was randomly set as an outlier with large SD of exon inclusion levels (SD = 0.2).

See this image and copyright information in PMC

References

1. Keren H, Lev-Maor G, Ast G. Alternative splicing and evolution: Diversification, exon definition and function. Nat Rev Genet. 2010;11(5):345–355. - PubMed
1. Nilsen TW, Graveley BR. Expansion of the eukaryotic proteome by alternative splicing. Nature. 2010;463(7280):457–463. - PMC - PubMed
1. Wang ET, et al. Alternative isoform regulation in human tissue transcriptomes. Nature. 2008;456(7221):470–476. - PMC - PubMed
1. Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat Genet. 2008;40(12):1413–1415. - PubMed
1. Wang GS, Cooper TA. Splicing in disease: Disruption of the splicing code and the decoding machinery. Nat Rev Genet. 2007;8(10):749–761. - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Associated data

SRA/SRS354082

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

rMATS: robust and flexible detection of differential alternative splicing from replicate RNA-Seq data

Affiliations

rMATS: robust and flexible detection of differential alternative splicing from replicate RNA-Seq data

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Associated data

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Research Materials