Missing data and technical variability in single-cell RNA-sequencing experiments
- PMID: 29121214
- PMCID: PMC6215955
- DOI: 10.1093/biostatistics/kxx053
Missing data and technical variability in single-cell RNA-sequencing experiments
Abstract
Until recently, high-throughput gene expression technology, such as RNA-Sequencing (RNA-seq) required hundreds of thousands of cells to produce reliable measurements. Recent technical advances permit genome-wide gene expression measurement at the single-cell level. Single-cell RNA-Seq (scRNA-seq) is the most widely used and numerous publications are based on data produced with this technology. However, RNA-seq and scRNA-seq data are markedly different. In particular, unlike RNA-seq, the majority of reported expression levels in scRNA-seq are zeros, which could be either biologically-driven, genes not expressing RNA at the time of measurement, or technically-driven, genes expressing RNA, but not at a sufficient level to be detected by sequencing technology. Another difference is that the proportion of genes reporting the expression level to be zero varies substantially across single cells compared to RNA-seq samples. However, it remains unclear to what extent this cell-to-cell variation is being driven by technical rather than biological variation. Furthermore, while systematic errors, including batch effects, have been widely reported as a major challenge in high-throughput technologies, these issues have received minimal attention in published studies based on scRNA-seq technology. Here, we use an assessment experiment to examine data from published studies and demonstrate that systematic errors can explain a substantial percentage of observed cell-to-cell expression variability. Specifically, we present evidence that some of these reported zeros are driven by technical variation by demonstrating that scRNA-seq produces more zeros than expected and that this bias is greater for lower expressed genes. In addition, this missing data problem is exacerbated by the fact that this technical variation varies cell-to-cell. Then, we show how this technical cell-to-cell variability can be confused with novel biological results. Finally, we demonstrate and discuss how batch-effects and confounded experiments can intensify the problem.
Figures





Similar articles
-
scNPF: an integrative framework assisted by network propagation and network fusion for preprocessing of single-cell RNA-seq data.BMC Genomics. 2019 May 8;20(1):347. doi: 10.1186/s12864-019-5747-5. BMC Genomics. 2019. PMID: 31068142 Free PMC article.
-
Detection of high variability in gene expression from single-cell RNA-seq profiling.BMC Genomics. 2016 Aug 22;17 Suppl 7(Suppl 7):508. doi: 10.1186/s12864-016-2897-6. BMC Genomics. 2016. PMID: 27556924 Free PMC article.
-
Normalization of Single-Cell RNA-Seq Data.Methods Mol Biol. 2021;2284:303-329. doi: 10.1007/978-1-0716-1307-8_17. Methods Mol Biol. 2021. PMID: 33835450
-
Single-cell RNA sequencing in breast cancer: Understanding tumor heterogeneity and paving roads to individualized therapy.Cancer Commun (Lond). 2020 Aug;40(8):329-344. doi: 10.1002/cac2.12078. Epub 2020 Jul 12. Cancer Commun (Lond). 2020. PMID: 32654419 Free PMC article. Review.
-
Microfluidics Facilitates the Development of Single-Cell RNA Sequencing.Biosensors (Basel). 2022 Jun 24;12(7):450. doi: 10.3390/bios12070450. Biosensors (Basel). 2022. PMID: 35884253 Free PMC article. Review.
Cited by
-
All roads lead to heterogeneity: The complex involvement of astrocytes and microglia in the pathogenesis of Alzheimer's disease.Front Cell Neurosci. 2022 Aug 12;16:932572. doi: 10.3389/fncel.2022.932572. eCollection 2022. Front Cell Neurosci. 2022. PMID: 36035256 Free PMC article. Review.
-
C-ziptf: stable tensor factorization for zero-inflated multi-dimensional genomics data.BMC Bioinformatics. 2024 Oct 5;25(1):323. doi: 10.1186/s12859-024-05886-4. BMC Bioinformatics. 2024. PMID: 39369208 Free PMC article.
-
Phenotyping Tumor Heterogeneity through Proteogenomics: Study Models and Challenges.Int J Mol Sci. 2024 Aug 14;25(16):8830. doi: 10.3390/ijms25168830. Int J Mol Sci. 2024. PMID: 39201516 Free PMC article. Review.
-
Simple Tool for Rapidly Assessing the Quality of Multiplexed Single Cell Proteomics Data.J Am Soc Mass Spectrom. 2023 Dec 6;34(12):2615-2619. doi: 10.1021/jasms.3c00238. Epub 2023 Nov 22. J Am Soc Mass Spectrom. 2023. PMID: 37991989 Free PMC article.
-
Differential detection workflows for multi-sample single-cell RNA-seq data.bioRxiv [Preprint]. 2023 Dec 19:2023.12.17.572043. doi: 10.1101/2023.12.17.572043. bioRxiv. 2023. PMID: 38187695 Free PMC article. Preprint.
References
-
- Achim, K., Pettit, J.-B., Saraiva, L. R., Gavriouchkina, D., Larsson, T., Arendt, D. and Marioni, J. C. (2015). High-throughput spatial mapping of single-cell RNA-seq data to tissue of origin. Nature Biotechnology 33, 503–509. - PubMed
-
- Borel, C., Ferreira, P. G., Santoni, F., Delaneau, O., Fort, A., Popadin, K. Y., Garieri, M., Falconnet, E., Ribaux, P., Guipponi, M., Padioleau, I., Carninci, P., Dermitzakis, E. T.. and others (2015). Biased allelic expression in human primary fibroblast single cells. American Journal of Human Genetics 96, 70–80. - PMC - PubMed
-
- Bray, N. L., Pimentel, H., Melsted, P. and Pachter, L. (2016). Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology 34, 525–527. - PubMed
-
- Brennecke, P., Anders, S., Kim, J. K., Kołodziejczyk, A. A., Zhang, X., Proserpio, V., Baying, B., Benes, V., Teichmann, S. A., Marioni, J. C.. and others (2013). Accounting for technical noise in single-cell RNA-seq experiments. Nature Methods 10, 1093–1095. - PubMed
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources