Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Dec 14;12(Suppl 7):114.
doi: 10.1186/s12918-018-0638-y.

MISC: missing imputation for single-cell RNA sequencing data

Affiliations

MISC: missing imputation for single-cell RNA sequencing data

Mary Qu Yang et al. BMC Syst Biol. .

Erratum in

Abstract

Background: Single-cell RNA sequencing (scRNA-seq) technology provides an effective way to study cell heterogeneity. However, due to the low capture efficiency and stochastic gene expression, scRNA-seq data often contains a high percentage of missing values. It has been showed that the missing rate can reach approximately 30% even after noise reduction. To accurately recover missing values in scRNA-seq data, we need to know where the missing data is; how much data is missing; and what are the values of these data.

Methods: To solve these three problems, we propose a novel model with a hybrid machine learning method, namely, missing imputation for single-cell RNA-seq (MISC). To solve the first problem, we transformed it to a binary classification problem on the RNA-seq expression matrix. Then, for the second problem, we searched for the intersection of the classification results, zero-inflated model and false negative model results. Finally, we used the regression model to recover the data in the missing elements.

Results: We compared the raw data without imputation, the mean-smooth neighbor cell trajectory, MISC on chronic myeloid leukemia data (CML), the primary somatosensory cortex and the hippocampal CA1 region of mouse brain cells. On the CML data, MISC discovered a trajectory branch from the CP-CML to the BC-CML, which provides direct evidence of evolution from CP to BC stem cells. On the mouse brain data, MISC clearly divides the pyramidal CA1 into different branches, and it is direct evidence of pyramidal CA1 in the subpopulations. In the meantime, with MISC, the oligodendrocyte cells became an independent group with an apparent boundary.

Conclusions: Our results showed that the MISC model improved the cell type classification and could be instrumental to study cellular heterogeneity. Overall, MISC is a robust missing data imputation model for single-cell RNA-seq data.

Keywords: False negative curve; Missing data; Single-cell RNA-seq; Zero-inflated model.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
Flowchart of missing imputations on single-cell RNA-seq (MISC). It consists of data acquisition, problem modeling, machine learning and downstream validation. The machine learning approach includes binary classification, ensemble learning and regression
Fig. 2
Fig. 2
Missing data imputation benefits to reveal CML stem cell trajectories associated with disease progression in CML. The trajectories include five types of stem cells, CP-CML in black (n = 477), normal HSCs in blue (n = 232), pre-BC samples taken as the patients presented in CP (pre-BC) 12 months and 3 months before transformation to myeloid and lymphoid BC in green (n = 185), BC-CML in purple (n = 155) and K562 in red (n = 53) using the top 234 differentially expressed genes. a The single-cell RNA-seq expression trajectories analyzed on CML stem cells without data imputation. b The trajectory analysis on CML stem cells using the mean-smooth method with neighbor cells on the trajectory. c The trajectory analysis on CML stem cells using MISC methods to recover the CML data
Fig. 3
Fig. 3
t-SNE analysis on imputed single-cell RNA-seq reveals more clearly subpopulations of CML stem cells. All types of these stem cells are of CP-CML in black (n = 477), normal HSCs in blue (n = 232), pre-BC samples taken from the patients presented in CP (pre-BC), 12 months and 3 months before transformation to myeloid and lymphoid BC in green (n = 185), BC-CML in purple (n = 155) and K562 in red (n = 53). Red ovals focus on the group of BC-CML stem cells. a The t-SNE analysis on the CML stem cell data without missing the imputation. b The t-SNE analysis on the CML stem cell data using the mean-smooth method with neighbor cells on the trajectory. c The t-SNE analysis on CML stem cell data using the MISC method
Fig. 4
Fig. 4
The overlap of the missing data discovered by ZIM, FNC and LLC. The red circle is the missing data discovered by the zero-inflated model (ZIM); the green circle is false negative curve (FNC); the blue circle is from large linear classification (LLC). LLC∩ZIM = 11,117,664,47.6%; LLC∩FNC = 11,040,187, 47.2%; ZIM∩FNC = 11,745,190, 50.2%; LLC∩ZIM∩FNC = 5,493,856, 23.4%
Fig. 5
Fig. 5
Missing data imputation benefits to recover the trajectories of the primary somatosensory cortex and the hippocampal CA1 region single-cell RNA-seq data. The trajectories include seven cell types, such as astrocytes_ependymal in orange (n = 224), interneurons in chartreuse (n = 290), oligodendrocytes in aqua (n = 820), pyramidal SS in pink (n = 399), endothelial−mural in khaki (n = 235), microglia in green (n = 98) and pyramidal CA1 in purple (n = 939). a The single-cell RNA-seq expression trajectory analysis on the mouse brain cells without data imputation. b The trajectory analysis on the mouse brain cells using the method of mean-smooth neighbor cells on the trajectory. c The trajectories analysis on the mouse brain cells using MISC method to impute CML data
Fig. 6
Fig. 6
t-SNE analysis on imputed single-cell RNA-seq reveals cell populations of the primary somatosensory cortex and the hippocampal CA1 region of mouse brain cells. All types of these stem cells are interneurons in red (n = 290), pyramidal SS in yellow (n = 399), pyramidal CA1 in blue (n = 939), oligodendrocytes in cyan (n = 820), microglia in black (n = 98), endothelial-mural in teal (n = 235) and astrocytes-ependymal in pink (n = 224). Red ovals focus on the group of oligodendrocyte cells. a The t-SNE analysis on the mouse brain cell data without missing data imputation. b The t-SNE analysis on the mouse brain cell data using the mean-smooth method with neighbor cells on the trajectory. c The t-SNE analysis on mouse brain cell data using the MISC method

References

    1. Wagner A, Regev A, Yosef N. Revealing the vectors of cellular identity with single-cell genomics. Nat Biotechnol. 2016;34(11):1145. doi: 10.1038/nbt.3711. - DOI - PMC - PubMed
    1. Giustacchini A, Thongjuea S, Barkas N, Woll PS, Povinelli BJ, Booth CA, Sopp P, Norfo R, Rodriguez-Meira A, Ashley N. Single-cell transcriptomics uncovers distinct molecular signatures of stem cells in chronic myeloid leukemia. Nat Med. 2017;23(6):692–702. doi: 10.1038/nm.4336. - DOI - PubMed
    1. Leung ML, Wang Y, Waters J, Navin NE. SNES: single nucleus exome sequencing. Genome Biol. 2015;16(1):55. doi: 10.1186/s13059-015-0616-2. - DOI - PMC - PubMed
    1. Picelli S, Björklund ÅK, Faridani OR, Sagasser S, Winberg G, Sandberg R. Smart-seq2 for sensitive full-length transcriptome profiling in single cells. Nat Methods. 2013;10(11):1096. doi: 10.1038/nmeth.2639. - DOI - PubMed
    1. Bendall SC, Simonds EF, Qiu P, El-ad DA, Krutzik PO, Finck R, Bruggner RV, Melamed R, Trejo A, Ornatsky OI. Single-cell mass cytometry of differential immune and drug responses across a human hematopoietic continuum. Science (New York, NY) 2011;332(6030):687–696. doi: 10.1126/science.1198704. - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources