Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Nov 12:2024.11.11.622097.
doi: 10.1101/2024.11.11.622097.

Perspectives on Codebook: sequence specificity of uncharacterized human transcription factors

Affiliations

Perspectives on Codebook: sequence specificity of uncharacterized human transcription factors

Arttu Jolma et al. bioRxiv. .

Abstract

We describe an effort ("Codebook") to determine the sequence specificity of 332 putative and largely uncharacterized human transcription factors (TFs), as well as 61 control TFs. Nearly 5,000 independent experiments across multiple in vitro and in vivo assays produced motifs for just over half of the putative TFs analyzed (177, or 53%), of which most are unique to a single TF. The data highlight the extensive contribution of transposable elements to TF evolution, both in cis and trans, and identify tens of thousands of conserved, base-level binding sites in the human genome. The use of multiple assays provides an unprecedented opportunity to benchmark and analyze TF sequence specificity, function, and evolution, as further explored in accompanying manuscripts. 1,421 human TFs are now associated with a DNA binding motif. Extrapolation from the Codebook benchmarking, however, suggests that many of the currently known binding motifs for well-studied TFs may inaccurately describe the TF's true sequence preferences.

Keywords: ChIP-seq; Codebook; DNA-binding specificity; GHT-SELEX; HT-SELEX; Motif; PBM; PWM; SELEX; SMiLE-seq; TF; Transcription factor.

PubMed Disclaimer

Conflict of interest statement

DECLARATION OF COMPETING INTERESTS O.F. is employed by Roche.

Figures

Figure 1.
Figure 1.. Codebook project overview.
Top, Categories of 393 TFs assayed and their associated constructs. Middle, Graphical summary of assays employed. Bottom left, Example of performance (as AUROC) of the best performing PWM for TPRX1, for each combination of experiment type – one for motif derivation (rows), and one for motif testing (columns). Bottom right, Depiction of the approval process for each individual experiment, including comparison of motifs and/or binding sites between replicates, evaluation of motifs across experiments, and motif similarity between related TFs (see Experiment evaluation by expert curation). Heatmap shows approved experiments for all 393 TFs across all experiment types.
Figure 2.
Figure 2.. Similarity of Codebook TF motifs.
Symmetric heatmap displaying the similarity between expert-curated PWMs for each pair of Codebook TFs, clustered by Pearson correlation with average linkage. The PWM similarity metric is the correlation between pairwise affinities to 200,000 random sequences of length 50, as calculated by MoSBAT. Pullouts and labels illustrate specific points in the main text.
Figure 3.
Figure 3.. Neglected DNA-binding domains.
Overview of new motifs for previously understudied TF families. A, Top, Number of DACH1 and DACH2 orthologs (union of one-to-one and one-to-many) across Ensembl v111 vertebrates and selected invertebrates. Species order reflects the Ensembl species tree. Bottom, AlphaFold3-predicted structure of the DACH1 SKI/SNO/DAC region (residues 130 – 390) bound to an HT-SELEX ligand sequence with a high-scoring PWM hit. B, Top, Sequence logos and sequence relationships of human C-Clamp domains (*ZNF704 motif from ). Bottom, AlphaFold3-predicted structure of two full-length SLC2A4RG proteins bound to a CTOP sequence with flanking sequences (chr17:48,048,369–48,048,401), and four Zn2+ ions (grey). The remainder of the proteins (beyond the C-clamp and C2H2-zf domains) are hidden, for visual simplicity. C. Left, Sequence logos of human TFs that are derived from the domestication of Tigger and Pogo DNA transposon DBDs elements and have known DNA binding motifs. Tree is a maximum-likelihood phylogram from FastTree, using DBD sequence alignment with MAFFT L-INS-I, rooted on POGK, which is derived from an older family of Tigger-like elements,. Sequence logos are Codebook-derived, except for CENPB. Right, average per-base read count over Tigger15a TOPs in the human genome, for JRK ChIP-seq (orange) and GHT-SELEX (purple), with sequences aligned to the Tigger15a consensus sequence. JRK PWM scores at each base of the Tigger15a consensus sequence are shown in black (plus strand) and grey (minus strand).
Figure 4.
Figure 4.. Conservation of Codebook TF binding sites and association with genomic features.
A, Heatmaps of phyloP scores over the PWM hit and 50 bp flanking for TOP sites for four TFs (two controls and two Codebook TFs). Statistical test results (see main text and Methods) are indicated at right. B, Left, Donut plot displays the proportion and number of clusters of conserved TOP (CTOP) sites that overlap the genomic features indicated. Middle, Bar plot displays the mean # of individual CTOPs contained within clusters that overlap the examined genomic regions. C. A 1,420-base, CpG-island-overlapping CTOP cluster (chr12:120368293–120369713). Zoonomia 241-mammal phyloP scores and Multiz 471 Mammal alignment PhastCons Conserved Elements are shown. D, Bar plot of the frequency of TFs with CTOPs that occur most frequently in CTOP clusters that overlap CpG and non-CpG protein coding promoters, respectively. E, CTOP cluster overlapping the non-CpG promoter at chr12:57,745,278–57,745,396. F, CTOP site for the KRAB-C2H2-zf protein ZNF689, overlapping an L1ME4a located at chr16:25,403,631–25,403,717.
Figure 5.
Figure 5.. Allele-specific transcription factor binding and chromatin accessibility.
A, Scheme of the analysis: identification of allele-specific binding sites (ASBs) from Codebook ChIP-Seq and GHT-SELEX data and annotation of allele-specific chromatin accessibility variants (ASVs) with the Codebook motifs. B, Distribution of PWM score (log-odds) fold changes between alleles for non-ASB SNPs, ASBs in peaks, and ASBs in TOPs. Left, 32 positive control TFs, Right, 85 Codebook TFs. P-values: Mann-Whitney U test. C. An example ASV for ZNF70, in chr12:6,763,200–6,765,850, around 1kb upstream of the PTMS gene. Onset shows the exact location of the ASV (with A/G alleles) together with the corresponding PWM hit. Allelic read counts for three available ATAC- and DNase-seq samples are shown on the side. D. The ratio of concordant-to-discordant PWM hits for <SNP, TF> pairs for non-ASVs (red), all ASVs (yellow), ASVs overlapping with peaks (blue), and ASVs in TOPs (green). P-values: Fisher’s exact test. E. Left, Fraction of ASVs overlapping with PWM hits for four example TFs, using 4 different thresholds on ASV significance: all SNPs (blue), 25% FDR ASVs (yellow), 10% FDR ASVs (orange), and 5% FDR ASVs (red). Right, Fraction of ASVs at each location within the genome-wide PWM hits of the representative TFs using four thresholds (same colors as in bar plots).
Figure 6.
Figure 6.. Motif coverage of human TFs, by DBD family.
TFs are categorized into structural classes based on Lambert et al.. See Table S10 for underlying information.

References

    1. Lambert S.A. et al. The Human Transcription Factors. Cell 175, 598–599 (2018). - PubMed
    1. Stormo G.D. & Zhao Y. Determining the specificity of protein-DNA interactions. Nat Rev Genet 11, 751–60 (2010). - PubMed
    1. Stormo G.D. Consensus patterns in DNA. Methods Enzymol 183, 211–21 (1990). - PubMed
    1. Schneider T.D. & Stephens R.M. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res 18, 6097–100 (1990). - PMC - PubMed
    1. Benos P.V., Bulyk M.L. & Stormo G.D. Additivity in protein-DNA interactions: how good an approximation is it? Nucleic Acids Res 30, 4442–51 (2002). - PMC - PubMed

Publication types

LinkOut - more resources