Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Jul 25:2025.07.25.666784.
doi: 10.1101/2025.07.25.666784.

snATAC-Express infers Gene Expression from Prioritized Chromatin Accessibility Peaks using Machine Learning

Affiliations

snATAC-Express infers Gene Expression from Prioritized Chromatin Accessibility Peaks using Machine Learning

Margaret Brown et al. bioRxiv. .

Abstract

Background: Single cell multi-omic investigation opens-up new opportunities to understand mechanisms of gene regulation. Existing methods for inferring transcript abundance from chromatin accessibility fail to prioritize the most relevant peaks and tend to assume positive associations between ATAC peaks and RNA counts. We hypothesize that gene regulation can be modeled as a function of combined positive and negative interactions among peaks and that causal regulatory variants are enriched in the vicinity of the most critical peaks.

Results: A machine learning pipeline leveraging single nuclear multiomic transcriptome and chromatin accessibility data is developed to model gene expression as a function of ATAC peak intensity. Multiome data was available for 18 immune cell types from 29 donors, 19 with Crohn's disease. The pipeline aggregates results from three machine learning approaches (random forest regression, XGBoost, and Light GBM) as well as linear regression to identify which ATAC peaks contribute to explaining variation among donors and cell types in pseudobulk gene expression. The coefficient of determination with cross-validation was used to identify robust models which typically explain between 5% and 40% of transcript abundance, utilizing on average 47% of the ATAC peaks, representing a significant gain in predictive accuracy. The most important peaks are enriched in GWAS variants for inflammatory bowel disease and the autoimmune disease systemic lupus erythematosus, but not for rheumatoid arthritis.

Conclusion: Atlanta Plots visualize the proportion of ATAC peaks contributing to a predictive model of gene expression as well as the proportion of variance explained by the model. Software implementing our pipeline, "snATAC-Express", is freely available on GitHub.

Keywords: GWAS enrichment; Gene regulation; chromatin accessibility; machine learning; multi-omics.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors declare that they have no competing interests. RDW is the developer of JMP commercial statistical software that includes machine learning tools similar to those described here, but all code was and can be implemented with open source software.

Figures

Figure 1.
Figure 1.
Model overview. (A) Depiction of how chromatin accessible peaks within a gene’s cis-regulatory region may or may not correlated clearly with gene expression. (B) Venn diagram of genes tagged by variant associations from GWAS for three traits: IBD, RA, and SLE. (C) Overview of predictive modeling pipeline. Round 1 models are performed for four prediction strategies linear regression, random forest regression, XGBoost, and LightGBM, which each undergo k-fold cross validation and peak rankings within each fold. The top 95% cumulatively important peaks are identified and used for training for a second round of predictive modeling. (D) Ensembled results of each gene for Round 1 and Round 2 models, aggregated with and without linear regression, using all peaks mapped to the gene, peaks in at least 10% of samples which mapped to the gene, and peaks in at least 50% of samples mapped to the gene. IBD, Inflammatory Bowel Disease; RA, Rheumatoid Arthritis; SLE, Systemic Lupus Erythematosus.
Figure 2.
Figure 2.
Machine learning predictions compared with ArchR’s Gene Scores. (A) Comparison of predicted gene expression values versus true gene expression values across all models, measured by linear least squares R2 value (left). Comparison of predicted gene expression values versus true gene expression values across all models, measured by R2 value computed as 1-(RSS/TSS) (right). (B) Heatmaps illustrating the actual gene expression (left), predicted gene expression from Round 2 modeling of the top 95% cumulatively important peaks present in at least 10% of samples (middle), and ArchR’s predicted gene scores (right). Values are z-scored log2(CPM+1). RSS, residual sum of squares; TSS, total sum of squares; GEX, gene expression; CPM, counts per million.
Figure 3.
Figure 3.
Atlanta plots show that a subset of peaks explain gene expression. Atlanta plots of GWAS tagged genes from IBD, SLE and RA which have been grouped by pathway analysis. The gray bars represent the total number of peaks within the gene’s cis-regulatory region, and the green bars represent the number of peaks in the region used to trained on for gene expression prediction. The blue bars represent the amount of gene expression explained, as the ensembled R2 value from k-fold cross validation. Genes are clustered by pathways and arranged from left to right according to the ratio of the proportion of peaks included to the amount of variance explained.
Figure 4.
Figure 4.
Assessment of Shapley values for ATAC peak contributions for CD40. (A) Dotplot showing the gene expression for CD40, primarily in B cell populations. (B) Beeswarm plot depicting the Shapley values for each ATAC peak per sample, ordered by the absolute value of the mean Shapley values per peak. Peaks marked with an asterisk harbor a GWAS variant associated with either SLE, IBD, and/or RA. (C) Scatter plot showing the correlation between the accessibility values of three ATAC peaks harboring GWAS variants for CD40 versus their Shapley values (left) and gene expression (right). ATAC accessibility values and gene expression values are pseudobulked, log2(CPM+1). (D) Track plots for CD40, with peaks not used for training shown in grey lines, and peaks which are used for training in the red and green lines. Red lines indicate peaks which harbor a GWAS variant.
Figure 5.
Figure 5.
Co-accessibility heatmaps. Heatmaps showing the pairwise correlations from Spearman correlation coefficient of peaks compared pairwise. For each gene, peaks are ordered by their importance rank in predicting gene expression and divided into three tiers to determine whether the most important peaks (Tier 1), second-most important peaks (Tier 2), or least important peaks (Tier 3) tend to be most co-accessible. HLA-DRA’s Tier 1 peaks have the most co-accessibility of all three tiers. Peaks for GALC are moderately accessible across all three tiers, and peaks for BANK1 tend to be the most co-accessible in Tiers 1 and 2 only.
Figure 6.
Figure 6.
GWAS variants are enriched in important peaks. Density plots illustrate the rank of the ATAC peak used for gene expression prediction which contains the GWAS variant for all traits (A) and stratified by each trait (B).

Similar articles

References

    1. Sun Q, Crowley CA, Huang L, Wen J, Chen J, Bao EL, et al. From GWAS variant to function: A study of approximately 148,000 variants for blood cell traits. HGG Adv. 2022; 3:100063. - PMC - PubMed
    1. Abdellaoui A, Yengo L, Verweij KJH, Visscher PM. 15 years of GWAS discovery: Realizing the promise. Am J Hum Genet. 2023;110:179–94. - PMC - PubMed
    1. Tang F, Barbacioru C, Wang Y, Nordman E, Lee C, Xu N, et al. mRNA-Seq whole-transcriptome analysis of a single cell. Nat Methods. 2009;6:377–82. - PubMed
    1. Sandberg R. Entering the era of single-cell transcriptomics in biology and medicine. Nat Methods. 2014;11:22–4. - PubMed
    1. Li Z, Kuppe C, Ziegler S, Cheng M, Kabgani N, Menzel S, Zenke M, Kramann R, Costa IG. Chromatin-accessibility estimation from single-cell ATAC-seq data with scOpen. Nat Commun. 2021;12:6386. - PMC - PubMed

Publication types

LinkOut - more resources