. 2017 Jul 27;45(13):e127.

doi: 10.1093/nar/gkx456.

Gene expression variability and the analysis of large-scale RNA-seq studies with the MDSeq

Di Ran¹, Z John Daye²

Affiliations

¹ Mel and Enid Zuckerman College of Public Health, The University of Arizona, Tucson, AZ 85724, USA.
² Independent Researcher, Raleigh, NC 27612, USA.

PMID: 28535263
PMCID: PMC5737414
DOI: 10.1093/nar/gkx456

Gene expression variability and the analysis of large-scale RNA-seq studies with the MDSeq

Di Ran et al. Nucleic Acids Res. 2017.

. 2017 Jul 27;45(13):e127.

doi: 10.1093/nar/gkx456.

Authors

Di Ran¹, Z John Daye²

Affiliations

¹ Mel and Enid Zuckerman College of Public Health, The University of Arizona, Tucson, AZ 85724, USA.
² Independent Researcher, Raleigh, NC 27612, USA.

PMID: 28535263
PMCID: PMC5737414
DOI: 10.1093/nar/gkx456

Abstract

Rapidly decreasing cost of next-generation sequencing has led to the recent availability of large-scale RNA-seq data, that empowers the analysis of gene expression variability, in addition to gene expression means. In this paper, we present the MDSeq, based on the coefficient of dispersion, to provide robust and computationally efficient analysis of both gene expression means and variability on RNA-seq counts. The MDSeq utilizes a novel reparametrization of the negative binomial to provide flexible generalized linear models (GLMs) on both the mean and dispersion. We address challenges of analyzing large-scale RNA-seq data via several new developments to provide a comprehensive toolset that models technical excess zeros, identifies outliers efficiently, and evaluates differential expressions at biologically interesting levels. We evaluated performances of the MDSeq using simulated data when the ground truths are known. Results suggest that the MDSeq often outperforms current methods for the analysis of gene expression mean and variability. Moreover, the MDSeq is applied in two real RNA-seq studies, in which we identified functionally relevant genes and gene pathways. Specifically, the analysis of gene expression variability with the MDSeq on the GTEx human brain tissue data has identified pathways associated with common neurodegenerative disorders when gene expression means were conserved.

PubMed Disclaimer

Figures

**Figure 1.**
Type I errors in the absence of differential expression variability. There are n samples of cases and controls each and varying proportions of excess zeros s. The *MDSeq*, Levene's tests, and heteroscedastic regression have well controlled type I errors for moderate to large sample sizes, whereas Bartlett's and MAD tests have highly inflated type I errors. Results are based on 1,000 simulations without additional covariates. Reference lines (in red) are drawn at the 0.05 error rate.

**Figure 2.**
Powers of detecting differential expression variability. There are n samples of cases and controls each and varying proportions of excess zeros s and log₂ fold-changes log₂FC. The *MDSeq* often performs the best when sample sizes are moderate or large. Levene's tests and heteroscedastic regression tend to deteriorate in performance with increasing proportions of excess zeros s. Results are based on 1,000 simulations without additional covariates.

**Figure 3.**
Type I errors in the absence of differential expression means. There are n samples of cases and controls each and varying proportions of excess zeros s. The *MDSeq* controls type I errors well at moderate to large sample sizes. *DESeq2* and *edgeR* methods may be conservative under the presence of excess zeros s > 0. Results are based on 1,000 simulations without additional covariates. Reference lines (in red) are drawn at the 0.05 error rate.

**Figure 4.**
Powers of detecting differential expression means. There are n samples of cases and controls each and varying proportions of excess zeros s and log₂ fold-change log₂FC. The *MDSeq* and *ShrinkBayes* often perform the best among methods compared under the presence of excess zeros s > 0. Results are based on 1,000 simulations without additional covariates.

**Figure 5.**
Hypothesis tests to evaluate absolute log fold-changes of expression means above given thresholds. Type I errors are shown at |log₂FC| less than or equal to a given threshold, and powers are presented when |log₂FC| is greater than the given threshold. The *MDSeq* and *DESeq2* have well controlled type I errors, whereas *edgeR* methods have highly inflated type I errors when |log₂FC| is at or moderately less than the given thresholds. The *MDSeq* has greater power than *DESeq2* when |log₂FC| is moderately above the given thresholds. There are 500 samples of cases and controls each. Results are based on 1,000 simulations generated from NB(2^log2FCμ₀, ϕ₀) with μ₀ = exp (5) and ϕ₀ = exp (4) for varying log₂FC. No excess zeros were generated with s = 0. Reference lines (in gray) are drawn at the corresponding threshold levels.

**Figure 6.**
Accuracy of the computationally efficient one-step estimator . The computationally efficient one-step estimator is compared with the leave-one-out influence measure I_i under scenarios when (A) there are no outliers and when (B) outliers are present. There are n = 250 samples each for cases and controls. Counts were generated from NB(μ_i, ϕ_i) for controls and NB(2^log2FCμ_i, ϕ_i) for cases with log₂FC = 2, where μ_i = exp (5 + (x_i1 + x_i2)/2) and ϕ_i = exp (4 + (x_i1 + x_i2)/2). Additional covariates x_i1 and x_i2 were simulated from binomial distributions *Binom*(2, *prob* = (0.5, 0.5)). In (B), five samples were randomly replaced by outliers simulated from *Pois*(exp (5)exp (4)). Non-outlying samples (in black) and outliers (in magenta) are plotted. Reference lines (in dashed blue) are drawn at the (α_out/2)th- and (1 − α_out/2)th-quantile of the variance-gamma distribution with α_out = 0.05. A diagonal reference line (in solid red) is drawn at equality of and I_i. In (B), all five outliers were identified by the one-step estimator .

formula image — **Figure 6.**
Accuracy of the computationally efficient one-step estimator . The computationally efficient one-step estimator is compared with the leave-one-out influence measure I_i under scenarios when (A) there are no outliers and when (B) outliers are present. There are n = 250 samples each for cases and controls. Counts were generated from NB(μ_i, ϕ_i) for controls and NB(2^log2FCμ_i, ϕ_i) for cases with log₂FC = 2, where μ_i = exp (5 + (x_i1 + x_i2)/2) and ϕ_i = exp (4 + (x_i1 + x_i2)/2). Additional covariates x_i1 and x_i2 were simulated from binomial distributions *Binom*(2, *prob* = (0.5, 0.5)). In (B), five samples were randomly replaced by outliers simulated from *Pois*(exp (5)exp (4)). Non-outlying samples (in black) and outliers (in magenta) are plotted. Reference lines (in dashed blue) are drawn at the (α_out/2)th- and (1 − α_out/2)th-quantile of the variance-gamma distribution with α_out = 0.05. A diagonal reference line (in solid red) is drawn at equality of and I_i. In (B), all five outliers were identified by the one-step estimator .

See this image and copyright information in PMC

Cited by

Developmental Programming: Prenatal Testosterone Excess on Liver and Muscle Coding and Noncoding RNA in Female Sheep.
Saadat N, Puttabyatappa M, Elangovan VR, Dou J, Ciarelli JN, Thompson RC, Bakulski KM, Padmanabhan V. Saadat N, et al. Endocrinology. 2022 Jan 1;163(1):bqab225. doi: 10.1210/endocr/bqab225. Endocrinology. 2022. PMID: 34718504 Free PMC article.
clrDV: a differential variability test for RNA-Seq data based on the skew-normal distribution.
Li H, Khang TF. Li H, et al. PeerJ. 2023 Sep 29;11:e16126. doi: 10.7717/peerj.16126. eCollection 2023. PeerJ. 2023. PMID: 37790621 Free PMC article.
Robust and Adaptive Non-Parametric Tests for Detecting General Distributional Shifts in Gene Expression.
Zhou F, Aw AJ, Erdmann-Pham DD, Fischer J, Song YS. Zhou F, et al. bioRxiv [Preprint]. 2025 Mar 11:2025.03.06.641952. doi: 10.1101/2025.03.06.641952. bioRxiv. 2025. PMID: 40161649 Free PMC article. Preprint.
Coordinated analysis of exon and intron data reveals novel differential gene expression changes.
Eghbalnia HR, Wilfinger WW, Mackey K, Chomczynski P. Eghbalnia HR, et al. Sci Rep. 2020 Sep 24;10(1):15669. doi: 10.1038/s41598-020-72482-w. Sci Rep. 2020. PMID: 32973253 Free PMC article.
Detection of genes with differential expression dispersion unravels the role of autophagy in cancer progression.
Le Priol C, Azencott CA, Gidrol X. Le Priol C, et al. PLoS Comput Biol. 2023 Mar 9;19(3):e1010342. doi: 10.1371/journal.pcbi.1010342. eCollection 2023 Mar. PLoS Comput Biol. 2023. PMID: 36893104 Free PMC article.

See all "Cited by" articles

References

1. Markert J.M., Fuller C.M., Gillespie G.Y., Bubien J.K., McLean L.A., Hong R.L., Lee K., Gullans S.R., Mapstone T.B., Benos D.J.. Differential gene expression profiling in human brain tumors. Physiol. Genomics. 2001; 5:21–33. - PubMed
1. Jiang Y., Harlocker S.L., Molesh D.A., Dillon D.C., Stolk J.A., Houghton R.L., Repasky E.A., Badaro R., Reed S.G., Xu J.. Discovery of differentially expressed genes in human breast cancer using subtracted cDNA libraries and cDNA microarrays. Oncogene. 2002; 21:2270–2282. - PubMed
1. Richer J.K., Jacobsen B.M., Manning N.G., Abel M.G., Wolf D.M., Horwitz K.B.. Differential gene regulation by the two progesterone receptor isoforms in human breast cancer cells. J. Biol. Chem. 2002; 277:5209–5218. - PubMed
1. Gur-Dedeoglu B., Konu O., Kir S., Ozturk A.R., Bozkurt B., Ergul G., Yulug I.G.. A resampling-based meta-analysis for detection of differential gene expression in breast cancer. BMC Cancer. 2008; 8:396. - PMC - PubMed
1. Howell B.G., Solish N., Lu C., Watanabe H., Mamelak A.J., Freed I., Wang B., Sauder D.N.. Microarray profiles of human basal cell carcinoma: insights into tumor growth and behavior. J. Dermatol. Sci. 2005; 39:39–51. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Gene expression variability and the analysis of large-scale RNA-seq studies with the MDSeq

Affiliations

Gene expression variability and the analysis of large-scale RNA-seq studies with the MDSeq

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources