Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 May;29(5):465-482.
doi: 10.1089/cmb.2021.0403. Epub 2022 Mar 21.

Correlation Imputation for Single-Cell RNA-seq

Affiliations

Correlation Imputation for Single-Cell RNA-seq

Luqin Gan et al. J Comput Biol. 2022 May.

Abstract

Recent advances in single-cell RNA sequencing (scRNA-seq) technologies have yielded a powerful tool to measure gene expression of individual cells. One major challenge of the scRNA-seq data is that it usually contains a large amount of zero expression values, which often impairs the effectiveness of downstream analyses. Numerous data imputation methods have been proposed to deal with these "dropout" events, but this is a difficult task for such high-dimensional and sparse data. Furthermore, there have been debates on the nature of the sparsity, about whether the zeros are due to technological limitations or represent actual biology. To address these challenges, we propose Single-cell RNA-seq Correlation completion by ENsemble learning and Auxiliary information (SCENA), a novel approach that imputes the correlation matrix of the data of interest instead of the data itself. SCENA obtains a gene-by-gene correlation estimate by ensembling various individual estimates, some of which are based on known auxiliary information about gene expression networks. Our approach is a reliable method that makes no assumptions on the nature of sparsity in scRNA-seq data or the data distribution. By extensive simulation studies and real data applications, we demonstrate that SCENA is not only superior in gene correlation estimation, but also improves the accuracy and reliability of downstream analyses, including cell clustering, dimension reduction, and graphical model estimation to learn the gene expression network.

Keywords: auxiliary information; clustering; correlation completion; dimension reduction; ensemble learning; graphical modeling; imputation; single-cell RNA-sequencing.

PubMed Disclaimer

Conflict of interest statement

The authors declare they have no conflicting financial interests.

Figures

FIG. 1.
FIG. 1.
Correlation accuracy and estimated parameters of SCENAridge of single estimates. It presents a consistent pattern between lower MSE/CMD and higher ridge parameters. corshrink (Lounici, 2014) has relatively lower MSE and CMD with the reference correlation in all three data sets, which is consistent with its higher ridge coefficients. CMD, correlation matrix distance; MSE, mean squared error.
FIG. 2.
FIG. 2.
Dimension reduction accuracy. Scatterplots of the top two PC scores of the cells colored by cell type. Both SCENAridge and SCENAaverage appear to recover the reference data structure better than SAVER, yielding scatterplots with a clear separation among different types of cells. PC, principal component.
FIG. 3.
FIG. 3.
Clustering performance. ARI (higher is better) of cell type grouping through hierarchical clustering after dimension reduction through PCA explaining various proportions of variance. SCENAaverage yields the best clustering performance over all other methods in all data sets, and even better than the clustering obtained from the reference data in the chu data. ARI, adjusted rand index.
FIG. 4.
FIG. 4.
Genetic graph recovery. (A) F1 score (higher is better) quantifying the performance of methods at recovering the reference gene expression network of 50 most variable genes for various numbers of edges. SCENAridge exhibits strong performance for all data sets, whereas other methods' performance dramatically changes across different data sets. (B) Gene expression network of chu data, setting the number of edges to 50.
FIG. 5.
FIG. 5.
Dimension reduction accuracy of real data applications. Scatterplots of the top two PC scores of the cells colored by cell type. Both SCENAridge and SCENAaverage yield scatterplots with a clear separation among different types of cells.
FIG. 6.
FIG. 6.
Clustering performance of real data applications. ARI of cell type grouping through hierarchical clustering after dimension reduction using PCA explaining various proportions of variance. SCENAaverage yields the best clustering performance over all other methods in all three data sets.
FIG. 7.
FIG. 7.
Gene Expression Network Estimation. (A,B,C) Gene expression network (graphical lasso; number of edges selected through EBIC) estimate based on SCENAridge correlation estimate, colored by community detected through edge betweenness score. (D,E,F) Zoomed-in communities, which are enriched in biologically meaningful KEGG terms “DNA replication,” “Cell adhesion molecules,” and “GABAergic synapse” pathways, respectively. EBIC, extended Bayesian information criterion; KEGG, Kyoto Encyclopedia of Genes and Genomes.

References

    1. Cai, T.T., and Zhang, A.. 2016. Minimax rate-optimal estimation of high-dimensional covariance matrices with incomplete data. J. Multivar. Anal. 150, 55–74. - PMC - PubMed
    1. Chen, C., Wu, C., Wu, L., et al. . 2018. scRMD: Imputation for single cell RNA-seq data via robust matrix decomposition. bioRxiv 459404. Bioinformatics. 36, 3156–3161. - PubMed
    1. Chu, L.-F., Leng, N., Zhang, J., et al. . 2016. Single-cell RNA-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm. Genome Biol. 17, 173. - PMC - PubMed
    1. Consortium, E.P., et al. . 2012. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74. - PMC - PubMed
    1. Darmanis, S., Sloan, S.A., Zhang, Y., et al. . 2015. A survey of human brain transcriptome diversity at the single cell level. Proc. Natl Acad. Sci. U. S. A. 112, 7285–7290. - PMC - PubMed

Publication types

LinkOut - more resources