This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Jul 3:2023.09.24.559168.

doi: 10.1101/2023.09.24.559168.

GET: a foundation model of transcription across human cell types

Xi Fu^{1

2}, Shentong Mo^{3

4}, Alejandro Buendia¹, Anouchka Laurent⁵, Anqi Shao⁶, Maria Del Mar Alvarez-Torres¹, Tianji Yu¹, Jimin Tan⁷, Jiayu Su¹, Romella Sagatelian¹, Adolfo A Ferrando^{6

7}, Alberto Ciccia⁸, Yanyan Lan⁹, David M Owens^{5

10}, Teresa Palomero^{5

10}, Eric P Xing^{3

4}, Raul Rabadan^{1

2}

Affiliations

¹ Department of Systems Biology, Columbia University, New York, NY, USA.
² Department of Biomedical Informatics, Columbia University, New York, NY, USA.
³ Department of Machine Learning, Carnegie Mellon University, Pittsburgh, PA, USA.
⁴ Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE.
⁵ Institute for Cancer Genetics, Columbia University, New York, NY, USA.
⁶ Department of Dermatology, Columbia University, New York, NY, USA.
⁷ Regeneron Genetics Center, Regeneron, Tarrytown, NY, USA.
⁸ Department of Genetics and Development, Columbia University, New York, NY, USA.
⁹ Institute for AI Industry Research, Tsinghua University, Beijing, China.
¹⁰ Department of Pathology & Cell Biology, Columbia University, New York, NY, USA.

PMID: 39005360
PMCID: PMC11244937
DOI: 10.1101/2023.09.24.559168

GET: a foundation model of transcription across human cell types

Xi Fu et al. bioRxiv. 2024.

[Preprint]. 2024 Jul 3:2023.09.24.559168.

doi: 10.1101/2023.09.24.559168.

Authors

Affiliations

¹ Department of Systems Biology, Columbia University, New York, NY, USA.
² Department of Biomedical Informatics, Columbia University, New York, NY, USA.
³ Department of Machine Learning, Carnegie Mellon University, Pittsburgh, PA, USA.
⁴ Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE.
⁵ Institute for Cancer Genetics, Columbia University, New York, NY, USA.
⁶ Department of Dermatology, Columbia University, New York, NY, USA.
⁷ Regeneron Genetics Center, Regeneron, Tarrytown, NY, USA.
⁸ Department of Genetics and Development, Columbia University, New York, NY, USA.
⁹ Institute for AI Industry Research, Tsinghua University, Beijing, China.
¹⁰ Department of Pathology & Cell Biology, Columbia University, New York, NY, USA.

PMID: 39005360
PMCID: PMC11244937
DOI: 10.1101/2023.09.24.559168

Update in

A foundation model of transcription across human cell types.
Fu X, Mo S, Buendia A, Laurent AP, Shao A, Alvarez-Torres MDM, Yu T, Tan J, Su J, Sagatelian R, Ferrando AA, Ciccia A, Lan Y, Owens DM, Palomero T, Xing EP, Rabadan R. Fu X, et al. Nature. 2025 Jan;637(8047):965-973. doi: 10.1038/s41586-024-08391-z. Epub 2025 Jan 8. Nature. 2025. PMID: 39779852 Free PMC article.

Abstract

Transcriptional regulation, involving the complex interplay between regulatory sequences and proteins, directs all biological processes. Computational models of transcription lack generalizability to accurately extrapolate in unseen cell types and conditions. Here, we introduce GET, an interpretable foundation model designed to uncover regulatory grammars across 213 human fetal and adult cell types. Relying exclusively on chromatin accessibility data and sequence information, GET achieves experimental-level accuracy in predicting gene expression even in previously unseen cell types. GET showcases remarkable adaptability across new sequencing platforms and assays, enabling regulatory inference across a broad range of cell types and conditions, and uncovering universal and cell type specific transcription factor interaction networks. We evaluated its performance on prediction of regulatory activity, inference of regulatory elements and regulators, and identification of physical interactions between transcription factors. Specifically, we show GET outperforms current models in predicting lentivirus-based massive parallel reporter assay readout with reduced input data. In fetal erythroblasts, we identify distal (>1Mbp) regulatory regions that were missed by previous models. In B cells, we identified a lymphocyte-specific transcription factor-transcription factor interaction that explains the functional significance of a leukemia-risk predisposing germline mutation. In sum, we provide a generalizable and accurate model for transcription together with catalogs of gene regulation and transcription factor interactions, all with cell type specificity.

PubMed Disclaimer

Conflict of interest statement

Disclosure of potential conflicts of interest A US provisional patent with application number 63/486,855 has been filed by Columbia University on using the method developed in the manuscript to identify gene regulatory elements and altering gene regulation and expression, on which X.F. and R.R. are inventors. R.R. is a founder of Genotwin and a member of the SAB of DiaTech and Flahy. None of these activities are related to the work described in this manuscript.

Figures

**Figure 1.. The GET model’s universal applicability and exceptional accuracy as a transcription foundation model.**
a. GET derives transcriptional regulatory syntax (pretrain) from chromatin accessibility data across hundreds of cell types, providing reliable predictions (finetune) of gene expression in both seen and unseen cell types. The model’s broad applicability and comprehensibility allow for zero-shot prediction of lentiMPRA measurements, extensive identification of cell-type-specific regulatory elements and upstream transcription factors, universal embeddings of regulatory information, and causal understanding of transcription factor-transcription factor interactions. b. Schematic illustration of the training scheme of GET. The input of GET is a peak (accessible region) by transcription factor (motif) matrix derived from a human single cell (sc) ATAC-seq atlas, summarizing regulatory sequence information across a genomic locus of more than 2 Mbps. Through self-supervised random masked pretraining of the input data across more than 200 cell types, GET learns transcriptional regulatory syntax (p). Finetuned on paired single cell ATAC-seq and RNA-seq data, GET learns to transform the regulatory syntax to gene expression even in leave-out cell types (**f·p**). c. Benchmark of GET prediction performance on unseen cell types (Fetal astrocyte). Each point is a gene. Color represents normalized chromatin accessibility in TSS. Gene activity is a score widely used in modern scATAC-seq analysis pipelines. Top correlated cell type is the training cell type whose observed gene expression has the largest correlation with Fetal astrocyte, in this case Fetal inhibitory neuron. Mean cell type is the mean observed gene expression across training cell types. Dashed line represents linear fits. d. Example visualization of observed expression (top, log₁₀TPM), GET prediction (mid, log₁₀TPM) and chromatin accessibility (bottom, log₁₀CPM) of the BCL11A locus in Fetal erythroblast. Positive (negative) values represent expression on positive (negative) strand on hg38. e. GET trained on fetal cell types generalizes to adult cell types without retraining, outperforming the most correlated cell type baseline. X axis showing R² score between GET prediction in adult cell types and observed expression in most similar fetal cell types. Y axis showing R² score between GET prediction and observed expression in the adult cell type.

**Figure 2.. Transfer learning adapts GET to new platforms and measurements.**
a. Schematic illustration of transferring GET to a lymph node 10x multiome dataset. b. Finetuned GET accurately predicts expression in training and leave-out evaluation cell types. c. Schematic workflow of lentiMPRA experiments and *in silico* lentiMPRA using GET model finetuned on K562 multiome data. d. Schematic showing the application of GET in the zero-shot setting to predict gene expression from GBM patient samples (top) and finetuned on a single GBM patient sample used to predict gene expression for an extended cohort of GBM patients. e. Pearson correlation scores for GET expression prediction on GBM cells (n = 16 samples) comparing tumor, macrophages, and oligodendrocytes for zero-shot and one-shot (finetuned). f. Readout distribution of lentiMPRA (log₂RNA/DNA, top) and GET prediction (mean expression across genomic insertions, bottom) for different types of elements. Two sided Mann-Whitney U-test: Promoter vs. Peak: p < 1e-237; Peak vs. Heterochromatin: p < 1e-237; Heterochromatin vs. Control: p = 4.049e-04. ***: 1.00e-04 < p <= 1.00e-03; ****: p <= 1.00e-04. g. Benchmark of GET lentiMPRA prediction against Enformer on random subset of elements. X axes show observed lentiMPRA readout (log₂RNA/DNA). Y axes show predicted expression in log₁₀ TPM.

**Figure 3.. The GET model identifies cell-type-specific regulator and cis-regulatory elements.**
a. Case study of identifying cis-regulatory elements (CRE) and regulators controlling a phenotype, fetal hemoglobin (HbF) level. Four genome-wide association loci (BCL11A, MYB, NFIX, and HBG2) have been subjected to genome editing in a previous study, providing the labels for GET benchmarking. Region/motif contribution score for each gene can be computed using the GET model. b. GET identifies the GATA motif in erythroid-specific enhancer that upregulates BCL11A, an HbF repressor. (Top) motif contribution score for BCL11A expression in the erythroid-specific enhancer. (Mid) gRNA enrichment score (HbFBase). Higher score means enrichment in high HbF cells, which implies these edits disturb a cis-regulatory element or regulator binding site that can upregulate BCL11A. (Bottom) single cell ATAC-seq signal and peak from Fetal erythroblast. c. Genome tracks displaying inferred cis-regulatory elements (CREs) for BCL11A and NFIX loci. Plots for HBG2 and MYB loci can be found in Supplementary Figure 4c. The loci shown are Chr2:60324394–61074394 (0.75 Mbp) and Chr19:12694852–13794852 (1.10 Mbp). From top to bottom, the tracks represent: HbFBase, showing the gRNA enrichment score from base-editing experiments; GET, showing the inferred region importance score; Enformer, showing the inferred region importance score; HyenaDNA, showing the *in silico* mutagenesis (ISM) result using the pretrained HyenaDNA language model; ABC Powerlaw, showing the Activity-by-Contact prediction using fetal erythroblast ATAC and K562 Hi-C power law; ATAC-seq data from HUDEP-2, an erythroblast cell line; ATAC-seq data from fetal erythroblast cells, used in the training of GET; and HiChIP-seq data from HUDEP-2, demonstrating chromatin interactions. d. Benchmark results comparing GET to other methods for predicting enhancer-promoter pairs, including analysis of distal (>100kb) interactions. (Top) Erythroblast fetal hemoglobin regulating enhancer prediction. (Bottom) K562 CRISPRi enhancer target prediction. Area under precision-recall curve (AUPRC) is shown. Ablation of different GET prediction components (Jacobian, DNase, Powerlaw; see Method: Predict enhancer targets) is also shown in the plot. e. Predicted top three regulators (motifs) for BCL11A, NFIX, and HBG2. Similar sequence patterns are highlighted with color shades. f. GATA downstream targets inferred by GET (top 10% motif score) show functional enrichment in hemopoiesis. Scatterplot shows predicted gene expression (X-axis) and GATA-motif score (Y-axis) for GATA-targeted genes with predicted expression larger than 1. All transcription factors among these genes are labeled in the plot, where those involved in Hemopoiesis are highlighted in red.

**Figure 4.. GET captures regulatory information across cell types and informs casual transcription factor-transcription factor interaction.**
a. Workflow to collect and visualize cross-cell-type regulatory embedding, showing a tSNE visualization of the resulting embedding space colored by Louvain clustering. b. The cross-cell-type regulatory embedding reveals cell-type specificity in transcriptional regulation. Subsampled embedding from Fetal astrocyte (blue) and two Fetal erythroblast (yellow and brown) cell types are visualized with UMAP. c. Louvain clustering of subsampled embedding in panel b. Note that cluster 2 is an astrocyte specific cluster. d. Gene ontology enrichment of genes in cluster 2, showing astrocyte-relevant terms and astrocyte marker genes e.g. NFIA, GLI3. e. GET motif contribution Z-score (red means higher score compared to other clusters) for each cluster. Note that cluster 2 has elevated NFI/1 and NFI/2 motifs, which correspond to the NFI family transcription factors. f. Causal discovery using the GET motif contribution matrix identifies transcription factor-transcription factor interactions. Physical interactions from STRING databases are used as a benchmark to calculate the concordance. g. Example causal neighbor graph showing interactions (edges) between motifs (nodes). Edge weights represent interaction effect size. Edge directions mark casual directions. Blue and red edge color marks negative or positive estimated causal effect size by LiNGAM, respectively. Node color marks community detected on the full causal graph. In-community edges are marked by reduced saturation. h. Benchmark of concordance of inferred transcription factor-transcription factor interactions using different methods with physical interactions from the STRING database. X-axis marks different cutoffs of retained interaction in percentile of 79,242 total possible interactions. Y-axis marks the ratio of selected interactions that is also marked as interacted in STRING. Green line marks the random selection background. Orange line marks the result of selection using motif-motif contribution score correlation. Red line marks the causal discovery result. Shaded area marks standard error across 5 bootstraps. Green and aqua lines show results from motif colocalization, computed as correlation between motif binding vectors in accessible regions across all cell types (green) or in hepatocytes (aqua). The star marks the result from a recent mass-spectrometry-based transcription factor-transcription factor interaction atlas (0.23 Macro F1 at 1.09% recall). The round dot marks the performance of a colocalization score computed from 677 HepG2 TF ChIP-seq (0.13 Macro F1 at 5.24% recall, Method:Causal discovery of regulator interaction).

**Figure 5.. Structural properties of inferred transcription factor-transcription factor interactions through GET causal discovery.**
a. Catalogs of transcription factor-transcription factor interactions. Direct interactions include homodimer, intra-family heterodimer, or inter-family heterodimer. Cofactor-mediated interaction may include both cooperative and competitive binding. b. pLDDT plot for TFAP2A and ZFX, showing correspondence between high pLDDT regions and known protein domains (red rectangles). c. Predicted monomer structure of ZFX, showing DNA binding domain (DBD, grey) and intrinsically disordered region (IDR, red). d. Predicted structure of TFAP2A structured domains and ZFX IDR. Red and blue color marks negative and positive electrostatic surfaces. e. Molecular dynamics simulation of TFAP2A-ZFX IDR (red) or ZFX IDR monomer (purple). Collapsed structure in ZFX IDR monomer is highlighted in rectangle. f. Sequence logo of ZFX and TFAP2A transcription factor binding motifs. g. Co-immunoprecipitation analysis of TFAP2A and ZFX. h. pLDDT plot for EP300, highlighting TAZ1 and TAZ2 transcription factor interacting domains. Region of interest (red) and domain (green) marks annotated regions on UNIPROT. Low pLDDT regions are highlighted in gray shades. i. Prediction of structural interactions between SNAI1 N-terminal and EP300 TAZ2 domain. j. SNAI1 N-terminal and EP300 TAZ1 domain. k. RELA C-terminal and EP300 TAZ1 domain (right). Red and blue color marks negative and positive electrostatic surfaces.

**Figure 6.. GET identifies a cell type specific transcription factor-transcription factor interaction affected by a cancer-associated germline variant.**
a. pLDDT plot for PAX5. Showing three mutational hotspots: V26G, P80R, G183S/V/A, and two frameshift hotspots. Region annotations from UNIPROT are shown in the figure as “region of interest.” b. B-cell specific motif interactions of PAX/2. PAX5 is the highest expressed transcription factor with PAX/2 motif. RORA is the highest expressed transcription factor with the NR/3 motif. Color scheme follows Figure 4g. c. AlphaFold 3 predicted multimer structure of PAX5 IDR and RORA NR domain showing contacts around G183 (B: Back; F: Front). The blue-yellow surface in the back shows hydrophilicity and hydrophobicity, respectively. Blue-red strands in the front show low-high prediction confidence, respectively. d. Detection of NCOR1, NRIP1, NR3C1, and NR2C2 PAX5 interacting proteins in immunoprecipitates from PAX5 WT, PAX5 G183S and RHOA WT-BioID-expressing B-ALL REH cell line and in total protein lysates using Protein Ligation Assays. A representative experiment is shown. e. Quantification of PAX5-NR2C2 interaction in the streptavidin immunoprecipitation shown in d. f. Venn diagram of identified PAX/2 and NR/3 specific and common regulatory targets using GET gene-by-motif importance matrix. g. (Top) Enrichment analysis (−log 10 p-value from Fisher exact test) using B-cell associated gene sets in Shah et al. and (bottom) biological process gene ontology gene sets. Results for the PAX/2-NR/3 common genes are shown in this figure. Results for PAX/2 or NR/3 specific genes are shown in Supplementary Figure 9. h. Enrichment analysis for differentially expressed genes between PAX5 wild type vs. PAX5 loss (left) and PAX5 G183S vs. other PAX5 alterations (CNV loss, P80R). * indicates statistical significance from hypergeometric tests. Benjamini-Hochberg adjusted P-values are reported.

See this image and copyright information in PMC

References

1. Elkon R. & Agami R. Characterization of noncoding regulatory DNA in the human genome. Nat Biotechnol 35, 732–746 (2017). - PubMed
1. Richter W. F., Nayak S., Iwasa J. & Taatjes D. J. The Mediator complex as a master regulator of transcription by RNA polymerase II. Nat Rev Mol Cell Biol 23, 732–749 (2022). - PMC - PubMed
1. Malik S. & Roeder R. G. Regulation of the RNA polymerase II pre-initiation complex by its associated coactivators. Nat Rev Genet 1–16 (2023) doi: 10.1038/s41576-023-00630-9. - DOI - PMC - PubMed
1. Wang H., Schilbach S., Ninov M., Urlaub H. & Cramer P. Structures of transcription preinitiation complex engaged with the +1 nucleosome. Nat Struct Mol Biol 30, 226–232 (2023). - PMC - PubMed
1. Vierstra J. et al. Global reference mapping of human transcription factor footprints. Nature 583, 729–736 (2020). - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

GET: a foundation model of transcription across human cell types

Affiliations

GET: a foundation model of transcription across human cell types

Authors

Affiliations

Update in

Abstract

Conflict of interest statement

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources