Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Nov 4;12(1):6386.
doi: 10.1038/s41467-021-26530-2.

Chromatin-accessibility estimation from single-cell ATAC-seq data with scOpen

Affiliations

Chromatin-accessibility estimation from single-cell ATAC-seq data with scOpen

Zhijian Li et al. Nat Commun. .

Abstract

A major drawback of single-cell ATAC-seq (scATAC-seq) is its sparsity, i.e., open chromatin regions with no reads due to loss of DNA material during the scATAC-seq protocol. Here, we propose scOpen, a computational method based on regularized non-negative matrix factorization for imputing and quantifying the open chromatin status of regulatory regions from sparse scATAC-seq experiments. We show that scOpen improves crucial downstream analysis steps of scATAC-seq data as clustering, visualization, cis-regulatory DNA interactions, and delineation of regulatory features. We demonstrate the power of scOpen to dissect regulatory changes in the development of fibrosis in the kidney. This identifies a role of Runx1 and target genes by promoting fibroblast to myofibroblast differentiation driving kidney fibrosis.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. scOpen and benchmarking of scATAC-seq imputation methods.
a scOpen receives as input a sparse peak by cell count matrix. After matrix binarization, scOpen performs TF–IDF transformation followed by NMF for dimension reduction and matrix imputation. The imputed or reduced matrix can then be given as input for scATAC-seq methods for clustering, visualization, and interpretation of regulatory features. b Memory requirements of imputation/denoising methods on benchmarking datasets. The x-axis represents the number of elements of the input matrix (number of OC regions by cells). c Same as b for running time requirements. d Boxplot showing the evaluation of imputation/denoising methods for recovering true peaks. The y-axis indicates the area under the precision-recall curve (AUPR). Methods are ranked by the mean AUPR. The asterisk and the two asterisks mean that the method is outperformed by the top-ranked method (scOpen) with significance levels of 0.05 and 0.01 at a confidence level of 0.95 (Wilcoxon Rank Sum test, paired, two-sided), respectively (n = 1224 cells for Cell lines, n = 2210 cells for Hematopoiesis, n = 765 cells for T-cells, and n = 10,032 for PBMC). The box plot represents the median (central line), first and third quartiles (box bounds). The whiskers present the 1.5 interquartile range (IQR) and external dots represent outliers (data greater than or smaller than 1.5IQR). e Barplots showing silhouette score (y-axis) for benchmarking datasets. f Barplots showing the clustering accuracy for distinct imputation methods. The y-axis indicates the mean adjusted Rand Index (ARI). Dots represent individual ARI values of distinct clustering methods. Error bars represent the standard deviation (SD) of ARI. Data are represented as mean ± SD. The asterisk and the two asterisks mean that the method is outperformed by the top-ranked method with significance levels of 0.05 and 0.01 at a confidence level of 0.95 (n = 8 independent clustering experiments, Wilcoxon Rank Sum test, paired, two-sided), respectively. Source data for Fig. 1 are provided as a Source Data file.
Fig. 2
Fig. 2. Benchmarking of scATAC-seq clustering and downstream analysis.
a Bar plot showing an evaluation of distances estimated on distinct scATAC-seq representations with a silhouette score. b Bar plots showing the clustering accuracy (ARI) for distinct clustering pipelines. c Scatter plot comparing silhouette score of datasets by providing raw (x-axis) and scOpen estimated matrices (y-axis) as input for Cicero and chromVAR. Colors represent datasets and shapes represent methods. scABC is not evaluated as it does not provide a space transformation. d Same as c for clustering results (ARI) of Cicero, chromVAR, and scABC. e Precision-recall curves showing the evaluation of the predicted links on GM12878 cells using the raw and imputed matrix as input. We used data from pol-II ChIA-PET as true labels. Colors refer to methods. We reported the AUPR for the top 3 methods. f Same as e by using Hi-C data as true labels. g Visualization of co-accessibility scores (y-axis) of Cicero predicted with raw and scOpen estimated matrices contrasted with scores based on RNA pol-II ChIA-PET (purple) and promoter capture Hi-C (green) around the CD79A locus (x-axis). For ChIA-PET, the log-transformed frequencies of each interaction PET cluster represent co-accessibility scores, while the negative log-transformed p-values from the CHiCAGO software indicate Hi-C scores. h Scatter plot showing single-cell accessibility scores estimated by top-performing imputation methods (according to f) for the link between peak 1 and peak 2 (supported by Hi-C data). Each dot represents a cell and color refers to density. Pearson correlation is shown on the left-upper corner. Source data for Fig. 2 are provided as a Source Data file.
Fig. 3
Fig. 3. scOpen characterizes the progression of kidney fibrosis.
a ARI values (y-axis) contrasting clustering results and transferred labels using distinct dimensional reduction methods for scATAC-seq. Clustering was performed by only considering UUO kidney cells on day 0 (WT), day 2, or day 10 or the integrated data set (all days). b UMAP of the integrated UUO scATAC-seq after doublet removal with major kidney cell types: fibroblasts, descending loop of Henle and thin ascending loop of Henle (DL & TAL); macrophages (Mac), Lymphoid (T and B cells), endothelial cells (EC), thick ascending loop of Henle (TAL), distal convoluted tubule (DCT), collecting duct-principal cells (CD-PC), intercalated cells (IC), podocytes (Pod) and proximal tubule cells (PT S1; PT S2; PT S3; Injured PT). c Proportion of cells of selected clusters on either day 0, day 2 or day 10 experiments. d Heatmap with TF activity score (z-transformed) for TFs (y-axis) and selected clusters (x-axis). We highlight TFs with the decrease in activity scores in injured PTs (Rxra and Hnf4a), with high TF activity scores in injured PTs (Batf:Jun; Smad2:Smad3) and immune cells (Creb1; Nfkb1). e Transcription factor footprints (average ATAC-seq around predicted binding sites) of Rxra, Smad2::Smad3 and Nfkb1 for selected cell types. The logo of underlying sequences is shown below and the number of binding sites is shown top-left corner. f Transcription factor footprints of Rxra, Smad2::Smad3, and Nfkb1 for injured PT cells in day 0, day 2, and day 10. Source data for Fig. 3 are provided as a Source Data file.
Fig. 4
Fig. 4. Role of Runx1 in myofibroblast differentiation.
a Diffusion map showing sub-clustering of fibroblasts. Colors refer to sub-cell-types and arrow represents differentiation trajectory from fibroblast to myofibroblast. Pe pericyte, Fib fibroblast, MF myofibroblast. b Line plots showing cell proportion from the day after UUO along the trajectory. c Pseudotime heatmap showing gene activity (left) and TF motif activity (right) along the trajectory. d Footprinting profiles of Runx1 and Twist2 binding sites along the trajectory. e Immuno-fluorescence (IF) staining of Runx1 (red) in PDGFRb-eGFP mouse kidney. In sham-operated mice, Runx1 staining shows a reduced intensity in PDGFRb-eGFP+ cells compared to remaining kidney cells (arrows). f Immuno-fluorescence (IF) staining of Runx1 (red) in PDGFRb-eGFP mouse kidney at 10 days after UUO as compared to sham. Arrows indicate Runx1 staining in expanding PDGFRb-eGFP+ myofibroblasts. g Quantification of Runx1 nuclear intensity in PDGFRb-eGFP+ cells in sham vs. UUO mice. Error bars represent the SD of the intensity. Data are presented as mean ± SD. Statistical significance was assessed by a two-tailed Student’s t-test with p < 0.05 being considered statistically significant (n = 3 mice). h Performance of top-performing imputation methods on the prediction of Runx1 target genes measured with AUPR. i Peak-to-Gene links (top) predicted on scOpen matrix and associated to Tgfbr1 in fibroblast cells. The height of links represents its significance. Dash line represents the threshold of significance (FDR = 0.001). ATAC-seq tracks (below) were generated from pseudo-bulk profiles of fibroblast/myofibroblast cells with increasing pseudo time (0–20, 20–40, 40–60, 60–80, and 80–100). Binding sites of Runx1 (B1–B4) supported by ATAC-seq footprints and overlapping to peaks are highlighted on the bottom. j Scatter plot showing gene activity of Tgfbr1 and normalized peak accessibility from raw (upper) or scOpen imputed matrix (lower) for peak-to-gene link B4. Each dot represents cells in a given pseudotime and the overall correlation is shown in the left-upper corner. Scale bars in e and f represent 50 μm. For details on statistics and reproducibility, see the “Methods” section. Source data for Fig. 4 are provided as a Source Data file.

References

    1. Buenrostro JD, Giresi PG, Zaba LC, Chang HY, Greenleaf WJ. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat. Methods. 2013;10:1213–1218. doi: 10.1038/nmeth.2688. - DOI - PMC - PubMed
    1. Corces MR, et al. The chromatin accessibility landscape of primary human cancers. Science. 2018;362:eaav1898. doi: 10.1126/science.aav1898. - DOI - PMC - PubMed
    1. Schep AN, et al. Structured nucleosome fingerprints enable high-resolution mapping of chromatin architecture within regulatory regions. Genome Res. 2015;25:1757–1770. doi: 10.1101/gr.192294.115. - DOI - PMC - PubMed
    1. Li Z, et al. Identification of transcription factor binding sites using ATAC-seq. Genome Biol. 2019;20:45. doi: 10.1186/s13059-019-1642-2. - DOI - PMC - PubMed
    1. Buenrostro JD, et al. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature. 2015;523:486. doi: 10.1038/nature14590. - DOI - PMC - PubMed

Publication types

MeSH terms