Identifying Differentially Expressed Genes of Zero Inflated Single Cell RNA Sequencing Data Using Mixed Model Score Tests

Zhiqiang He¹, Yueyun Pan², Fang Shao¹, Hui Wang³

Affiliations

¹ Department of Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, China.
² First Clinical Medical College, Nanjing Medical University, Nanjing, China.
³ Department of Maternal and Child Health, School of Public Health, Peking University Health Science Center, Beijing, China.

PMID: 33613638
PMCID: PMC7894898
DOI: 10.3389/fgene.2021.616686

Identifying Differentially Expressed Genes of Zero Inflated Single Cell RNA Sequencing Data Using Mixed Model Score Tests

Zhiqiang He et al. Front Genet. 2021.

. 2021 Feb 5:12:616686.

doi: 10.3389/fgene.2021.616686. eCollection 2021.

Authors

Zhiqiang He¹, Yueyun Pan², Fang Shao¹, Hui Wang³

Affiliations

¹ Department of Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, China.
² First Clinical Medical College, Nanjing Medical University, Nanjing, China.
³ Department of Maternal and Child Health, School of Public Health, Peking University Health Science Center, Beijing, China.

PMID: 33613638
PMCID: PMC7894898
DOI: 10.3389/fgene.2021.616686

Abstract

Single cell RNA sequencing (scRNA-seq) allows quantitative measurement and comparison of gene expression at the resolution of single cells. Ignoring the batch effects and zero inflation of scRNA-seq data, many proposed differentially expressed (DE) methods might generate bias. We propose a method, single cell mixed model score tests (scMMSTs), to efficiently identify DE genes of scRNA-seq data with batch effects using the generalized linear mixed model (GLMM). scMMSTs treat the batch effect as a random effect. For zero inflation, scMMSTs use a weighting strategy to calculate observational weights for counts independently under zero-inflated and zero-truncated distributions. Counts data with calculated weights were subsequently analyzed using weighted GLMMs. The theoretical null distributions of the score statistics were constructed by mixed Chi-square distributions. Intensive simulations and two real datasets were used to compare edgeR-zinbwave, DESeq2-zinbwave, and scMMSTs. Our study demonstrates that scMMSTs, as supplement to standard methods, are advantageous to define DE genes of zero-inflated scRNA-seq data with batch effects.

Keywords: differential expression analyses; generalized linear mixed model; observational weights; score test; single cell RNA sequencing; zero inflation.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**FIGURE 1**
False positive rate control on simulated null Usoskin datasets and Tung datasets. **(A)** Boxplot of PCER for 30 simulated null Usoskin datasets generated by *splatter* for each of 12 DE methods. scMMSTs are marked in blue. **(B)** Histogram of uncorrected p-values for one dataset in panel A. **(C)** Boxplot of PCER for 30 simulated null Tung datasets generated by *splatter* for each of 12 DE methods. scMMSTs are marked in blue. **(D)** Histogram of uncorrected p-values for one dataset in panel C. PCER, per-comparison error rate; DE, differential expression; scMMST, single cell mixed model score test.

**FIGURE 2**
FDP-TPR curves of DE methods on simulated Usoskin datasets and Tung datasets. **(A)** Line plot of the FDP-TPR curves for simulated Usoskin datasets generated by *splatter* for each of 12 DE methods. **(B)** Line plot of the FDP-TPR curves for simulated Tang datasets generated by *splatter* for each of 12 DE methods. Circles represent values at a 0.05 nominal FDR threshold and are filled in if the FDP (i.e., empirical FDR) is less than 0.05. DE, differential expression; TPR, true positive rate; FDP, false discovery proportion; FDR, false discovery rate.

**FIGURE 3**
FDP-TPR curves of DE methods on simulated datasets generated by GLMMs with μ_π = 0. **(A)** Line plot of the FDP-TPR curves for simulated datasets based on NB GLMMs for each of 12 DE methods with the dispersion parameter θ = 0.5. **(B)** Line plot of the FDP-TPR curves for simulated datasets based on negative binomial (NB) GLMMs for each of 12 DE methods with θ = 1. **(C)** Line plot of the FDP-TPR curves for simulated datasets based on NB GLMMs for each of 12 DE methods with θ = 2. **(D)** Line plot of the FDP-TPR curves for simulated datasets based on Poisson GLMMs for each of 12 DE methods with $β_{0} = σ_{β}^{2} =$ 0.01. Circles represent values at a 0.05 nominal FDR threshold and are filled in if the FDP (i.e., empirical FDR) is less than 0.05. DE, differential expression; GLMM, generalized linear mixed model; NB, negative binomial; TPR, true positive rate; FDP, false discovery proportion; FDR, false discovery rate.

**FIGURE 4**
AUCs of DE methods for simulated datasets generated by GLMMs with μ_π = 0. Adjusted p-values are used as predictors. **(A)** Bar plot of AUCs for simulated datasets generated by NB GLMMs for each of 12 DE methods with the dispersion parameter θ = 0.5. **(B)** Bar plot of AUCs for simulated datasets generated by NB GLMMs for each of 12 DE methods with θ = 1. **(C)** Bar plot of AUCs for simulated datasets generated by NB GLMMs for each of 12 DE methods with θ = 2. **(D)** Bar plot of AUCs for simulated datasets generated by Poisson GLMMs for each of 12 DE methods. AUC, area under curve; DE, differential expression; GLMM, generalized linear mixed model; NB, negative binomial.

**FIGURE 5**
Computational times for differential expression methods on the simulated null Usoskin and Tung datasets, which were generated by *splatter*. The number of cores were set to be 1 and 8 on a cluster with 24 Intel Xeon Processor (Skylake, IBRS) at 2.60 GHz (2593 MHz) and 128 GB RAM.

See this image and copyright information in PMC

Cited by

Challenges and best practices in omics benchmarking.
Brooks TG, Lahens NF, Mrčela A, Grant GR. Brooks TG, et al. Nat Rev Genet. 2024 May;25(5):326-339. doi: 10.1038/s41576-023-00679-6. Epub 2024 Jan 12. Nat Rev Genet. 2024. PMID: 38216661 Review.
Differential Expression Analysis of Single-Cell RNA-Seq Data: Current Statistical Approaches and Outstanding Challenges.
Das S, Rai A, Rai SN. Das S, et al. Entropy (Basel). 2022 Jul 18;24(7):995. doi: 10.3390/e24070995. Entropy (Basel). 2022. PMID: 35885218 Free PMC article. Review.
Leveraging gene correlations in single cell transcriptomic data.
Silkwood K, Dollinger E, Gervin J, Atwood S, Nie Q, Lander AD. Silkwood K, et al. BMC Bioinformatics. 2024 Sep 18;25(1):305. doi: 10.1186/s12859-024-05926-z. BMC Bioinformatics. 2024. PMID: 39294560 Free PMC article.
Leveraging gene correlations in single cell transcriptomic data.
Silkwood K, Dollinger E, Gervin J, Atwood S, Nie Q, Lander AD. Silkwood K, et al. bioRxiv [Preprint]. 2023 Nov 1:2023.03.14.532643. doi: 10.1101/2023.03.14.532643. bioRxiv. 2023. Update in: BMC Bioinformatics. 2024 Sep 18;25(1):305. doi: 10.1186/s12859-024-05926-z. PMID: 36993765 Free PMC article. Updated. Preprint.

References

1. Benjamini Y., Hochberg Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B-Methodol. 57 289–300. 10.1111/j.2517-6161.1995.tb02031.x - DOI
1. Böhning D., Dietz E., Schlattmann P., Mendonça L., Kirchner U. (1999). The zero-inflated Poisson model and the decayed, missing and filled teeth index in dental epidemiology. J. R. Stat. Soc. Ser. A 162 195–209. 10.1111/1467-985X.00130 - DOI
1. Breslow N. E., Clayton D. G. (1993). Approximate inference in generalized linear mixed models. J. Am. Stat. Assoc. 88 9–25. 10.2307/2290687 - DOI
1. Butler A., Hoffman P., Smibert P., Papalexi E., Satija R. (2018). Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36 411–420. 10.1038/nbt.4096 - DOI - PMC - PubMed
1. Büttner M., Miao Z., Wolf F. A., Teichmann S. A., Theis F. J. (2019). A test metric for assessing single-cell RNA-seq batch correction. Nat. Methods 16 43–49. 10.1038/s41592-018-0254-1 - DOI - PubMed

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Identifying Differentially Expressed Genes of Zero Inflated Single Cell RNA Sequencing Data Using Mixed Model Score Tests

Affiliations

Identifying Differentially Expressed Genes of Zero Inflated Single Cell RNA Sequencing Data Using Mixed Model Score Tests

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources

Other Literature Sources