False signals induced by single-cell imputation
- PMID: 30906525
- PMCID: PMC6415334
- DOI: 10.12688/f1000research.16613.2
False signals induced by single-cell imputation
Abstract
Background: Single-cell RNA-seq is a powerful tool for measuring gene expression at the resolution of individual cells. A challenge in the analysis of this data is the large amount of zero values, representing either missing data or no expression. Several imputation approaches have been proposed to address this issue, but they generally rely on structure inherent to the dataset under consideration they may not provide any additional information, hence, are limited by the information contained therein and the validity of their assumptions. Methods: We evaluated the risk of generating false positive or irreproducible differential expression when imputing data with six different methods. We applied each method to a variety of simulated datasets as well as to permuted real single-cell RNA-seq datasets and consider the number of false positive gene-gene correlations and differentially expressed genes. Using matched 10X and Smart-seq2 data we examined whether cell-type specific markers were reproducible across datasets derived from the same tissue before and after imputation. Results: The extent of false-positives introduced by imputation varied considerably by method. Data smoothing based methods, MAGIC, knn-smooth and dca, generated many false-positives in both real and simulated data. Model-based imputation methods typically generated fewer false-positives but this varied greatly depending on the diversity of cell-types in the sample. All imputation methods decreased the reproducibility of cell-type specific markers, although this could be mitigated by selecting markers with large effect size and significance. Conclusions: Imputation of single-cell RNA-seq data introduces circularity that can generate false-positive results. Thus, statistical tests applied to imputed data should be treated with care. Additional filtering by effect size can reduce but not fully eliminate these effects. Of the methods we considered, SAVER was the least likely to generate false or irreproducible results, thus should be favoured over alternatives if imputation is necessary.
Keywords: Gene expression; Imputation; RNA-seq; Reproducibility; Type 1 errors; single-cell.
Conflict of interest statement
No competing interests were disclosed.
Figures






Similar articles
-
scHinter: imputing dropout events for single-cell RNA-seq data with limited sample size.Bioinformatics. 2020 Feb 1;36(3):789-797. doi: 10.1093/bioinformatics/btz627. Bioinformatics. 2020. PMID: 31392316
-
Model-based autoencoders for imputing discrete single-cell RNA-seq data.Methods. 2021 Aug;192:112-119. doi: 10.1016/j.ymeth.2020.09.010. Epub 2020 Sep 22. Methods. 2021. PMID: 32971193 Free PMC article.
-
Evaluating imputation methods for single-cell RNA-seq data.BMC Bioinformatics. 2023 Jul 28;24(1):302. doi: 10.1186/s12859-023-05417-7. BMC Bioinformatics. 2023. PMID: 37507764 Free PMC article.
-
Subgroup analyses in randomised controlled trials: quantifying the risks of false-positives and false-negatives.Health Technol Assess. 2001;5(33):1-56. doi: 10.3310/hta5330. Health Technol Assess. 2001. PMID: 11701102 Review.
-
Retrospective analysis: reproducibility of interblastomere differences of mRNA expression in 2-cell stage mouse embryos is remarkably poor due to combinatorial mechanisms of blastomere diversification.Mol Hum Reprod. 2018 Jul 1;24(7):388-400. doi: 10.1093/molehr/gay021. Mol Hum Reprod. 2018. PMID: 29746690
Cited by
-
scRNMF: An imputation method for single-cell RNA-seq data by robust and non-negative matrix factorization.PLoS Comput Biol. 2024 Aug 8;20(8):e1012339. doi: 10.1371/journal.pcbi.1012339. eCollection 2024 Aug. PLoS Comput Biol. 2024. PMID: 39116191 Free PMC article.
-
A systematic evaluation of single-cell RNA-sequencing imputation methods.Genome Biol. 2020 Aug 27;21(1):218. doi: 10.1186/s13059-020-02132-x. Genome Biol. 2020. PMID: 32854757 Free PMC article.
-
scHumanNet: a single-cell network analysis platform for the study of cell-type specificity of disease genes.Nucleic Acids Res. 2023 Jan 25;51(2):e8. doi: 10.1093/nar/gkac1042. Nucleic Acids Res. 2023. PMID: 36350625 Free PMC article.
-
Joint learning of multiple gene networks from single-cell gene expression data.Comput Struct Biotechnol J. 2020 Sep 10;18:2583-2595. doi: 10.1016/j.csbj.2020.09.004. eCollection 2020. Comput Struct Biotechnol J. 2020. PMID: 33033579 Free PMC article.
-
Data normalization for addressing the challenges in the analysis of single-cell transcriptomic datasets.BMC Genomics. 2024 May 6;25(1):444. doi: 10.1186/s12864-024-10364-5. BMC Genomics. 2024. PMID: 38711017 Free PMC article. Review.
References
-
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: A practical and powerful approach to multiple testing. J Roy Stat Soc B Met. 1995;57(1):289–300. 10.1111/j.2517-6161.1995.tb02031.x - DOI
-
- Consortium, The Tabula Muris: Single-cell RNA-seq data from Smart-seq2 sequencing of FACS sorted cells. figshare.Fileset.2017a. 10.6084/m9.figshare.5715040.v1 - DOI
-
- Consortium, The Tabula Muris: Single-cell RNA-seq data from Smart-seq2 sequencing of FACS sorted cells. figshare.Fileset.2017b. 10.6084/m9.figshare.5715040.v1 - DOI
Publication types
Associated data
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources