Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Jul 27;45(13):e127.
doi: 10.1093/nar/gkx456.

Gene expression variability and the analysis of large-scale RNA-seq studies with the MDSeq

Affiliations

Gene expression variability and the analysis of large-scale RNA-seq studies with the MDSeq

Di Ran et al. Nucleic Acids Res. .

Abstract

Rapidly decreasing cost of next-generation sequencing has led to the recent availability of large-scale RNA-seq data, that empowers the analysis of gene expression variability, in addition to gene expression means. In this paper, we present the MDSeq, based on the coefficient of dispersion, to provide robust and computationally efficient analysis of both gene expression means and variability on RNA-seq counts. The MDSeq utilizes a novel reparametrization of the negative binomial to provide flexible generalized linear models (GLMs) on both the mean and dispersion. We address challenges of analyzing large-scale RNA-seq data via several new developments to provide a comprehensive toolset that models technical excess zeros, identifies outliers efficiently, and evaluates differential expressions at biologically interesting levels. We evaluated performances of the MDSeq using simulated data when the ground truths are known. Results suggest that the MDSeq often outperforms current methods for the analysis of gene expression mean and variability. Moreover, the MDSeq is applied in two real RNA-seq studies, in which we identified functionally relevant genes and gene pathways. Specifically, the analysis of gene expression variability with the MDSeq on the GTEx human brain tissue data has identified pathways associated with common neurodegenerative disorders when gene expression means were conserved.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Type I errors in the absence of differential expression variability. There are n samples of cases and controls each and varying proportions of excess zeros s. The MDSeq, Levene's tests, and heteroscedastic regression have well controlled type I errors for moderate to large sample sizes, whereas Bartlett's and MAD tests have highly inflated type I errors. Results are based on 1,000 simulations without additional covariates. Reference lines (in red) are drawn at the 0.05 error rate.
Figure 2.
Figure 2.
Powers of detecting differential expression variability. There are n samples of cases and controls each and varying proportions of excess zeros s and log2 fold-changes log2FC. The MDSeq often performs the best when sample sizes are moderate or large. Levene's tests and heteroscedastic regression tend to deteriorate in performance with increasing proportions of excess zeros s. Results are based on 1,000 simulations without additional covariates.
Figure 3.
Figure 3.
Type I errors in the absence of differential expression means. There are n samples of cases and controls each and varying proportions of excess zeros s. The MDSeq controls type I errors well at moderate to large sample sizes. DESeq2 and edgeR methods may be conservative under the presence of excess zeros s > 0. Results are based on 1,000 simulations without additional covariates. Reference lines (in red) are drawn at the 0.05 error rate.
Figure 4.
Figure 4.
Powers of detecting differential expression means. There are n samples of cases and controls each and varying proportions of excess zeros s and log2 fold-change log2FC. The MDSeq and ShrinkBayes often perform the best among methods compared under the presence of excess zeros s > 0. Results are based on 1,000 simulations without additional covariates.
Figure 5.
Figure 5.
Hypothesis tests to evaluate absolute log fold-changes of expression means above given thresholds. Type I errors are shown at |log2FC| less than or equal to a given threshold, and powers are presented when |log2FC| is greater than the given threshold. The MDSeq and DESeq2 have well controlled type I errors, whereas edgeR methods have highly inflated type I errors when |log2FC| is at or moderately less than the given thresholds. The MDSeq has greater power than DESeq2 when |log2FC| is moderately above the given thresholds. There are 500 samples of cases and controls each. Results are based on 1,000 simulations generated from NB(2log2FCμ0, ϕ0) with μ0 = exp (5) and ϕ0 = exp (4) for varying log2FC. No excess zeros were generated with s = 0. Reference lines (in gray) are drawn at the corresponding threshold levels.
Figure 6.
Figure 6.
Accuracy of the computationally efficient one-step estimator formula image. The computationally efficient one-step estimator formula image is compared with the leave-one-out influence measure Ii under scenarios when (A) there are no outliers and when (B) outliers are present. There are n = 250 samples each for cases and controls. Counts were generated from NB(μi, ϕi) for controls and NB(2log2FCμi, ϕi) for cases with log2FC = 2, where μi = exp (5 + (xi1 + xi2)/2) and ϕi = exp (4 + (xi1 + xi2)/2). Additional covariates xi1 and xi2 were simulated from binomial distributions Binom(2, prob = (0.5, 0.5)). In (B), five samples were randomly replaced by outliers simulated from Pois(exp (5)exp (4)). Non-outlying samples (in black) and outliers (in magenta) are plotted. Reference lines (in dashed blue) are drawn at the (αout/2)th- and (1 − αout/2)th-quantile of the variance-gamma distribution with αout = 0.05. A diagonal reference line (in solid red) is drawn at equality of formula image and Ii. In (B), all five outliers were identified by the one-step estimator formula image.

Similar articles

Cited by

References

    1. Markert J.M., Fuller C.M., Gillespie G.Y., Bubien J.K., McLean L.A., Hong R.L., Lee K., Gullans S.R., Mapstone T.B., Benos D.J.. Differential gene expression profiling in human brain tumors. Physiol. Genomics. 2001; 5:21–33. - PubMed
    1. Jiang Y., Harlocker S.L., Molesh D.A., Dillon D.C., Stolk J.A., Houghton R.L., Repasky E.A., Badaro R., Reed S.G., Xu J.. Discovery of differentially expressed genes in human breast cancer using subtracted cDNA libraries and cDNA microarrays. Oncogene. 2002; 21:2270–2282. - PubMed
    1. Richer J.K., Jacobsen B.M., Manning N.G., Abel M.G., Wolf D.M., Horwitz K.B.. Differential gene regulation by the two progesterone receptor isoforms in human breast cancer cells. J. Biol. Chem. 2002; 277:5209–5218. - PubMed
    1. Gur-Dedeoglu B., Konu O., Kir S., Ozturk A.R., Bozkurt B., Ergul G., Yulug I.G.. A resampling-based meta-analysis for detection of differential gene expression in breast cancer. BMC Cancer. 2008; 8:396. - PMC - PubMed
    1. Howell B.G., Solish N., Lu C., Watanabe H., Mamelak A.J., Freed I., Wang B., Sauder D.N.. Microarray profiles of human basal cell carcinoma: insights into tumor growth and behavior. J. Dermatol. Sci. 2005; 39:39–51. - PubMed

MeSH terms

LinkOut - more resources