Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jan 24;11(1):231088.
doi: 10.1098/rsos.231088. eCollection 2024 Jan.

Position-specific evolution in transcription factor binding sites, and a fast likelihood calculation for the F81 model

Affiliations

Position-specific evolution in transcription factor binding sites, and a fast likelihood calculation for the F81 model

Pavitra Selvakumar et al. R Soc Open Sci. .

Abstract

Transcription factor binding sites (TFBS), like other DNA sequence, evolve via mutation and selection relating to their function. Models of nucleotide evolution describe DNA evolution via single-nucleotide mutation. A stationary vector of such a model is the long-term distribution of nucleotides, unchanging under the model. Neutrally evolving sites may have uniform stationary vectors, but one expects that sites within a TFBS instead have stationary vectors reflective of the fitness of various nucleotides at those positions. We introduce 'position-specific stationary vectors' (PSSVs), the collection of stationary vectors at each site in a TFBS locus, analogous to the position weight matrix (PWM) commonly used to describe TFBS. We infer PSSVs for human TFs using two evolutionary models (Felsenstein 1981 and Hasegawa-Kishino-Yano 1985). We find that PSSVs reflect the nucleotide distribution from PWMs, but with reduced specificity. We infer ancestral nucleotide distributions at individual positions and calculate 'conditional PSSVs' conditioned on specific choices of majority ancestral nucleotide. We find that certain ancestral nucleotides exert a strong evolutionary pressure on neighbouring sequence while others have a negligible effect. Finally, we present a fast likelihood calculation for the F81 model on moderate-sized trees that makes this approach feasible for large-scale studies along these lines.

Keywords: compensatory mutation; evolution; transcription factor binding site.

PubMed Disclaimer

Conflict of interest statement

We declare we have no competing interests.

Figures

Figure 1.
Figure 1.
(a) Primate tree topology used in this work. (b) Mammal tree topology used in this work (branch lengths are calculated and not shown in this figure). (c) Sample tree with proximity labels on branches.
Figure 2.
Figure 2.
The Felsenstein pruning algorithm’s running time, compared to our ‘fast-star’ algorithm. As a function of tree size, we achieve a speedup of >50× over the pruning algorithm for trees with five leaves, and >4× for trees with 10 leaves, over a range of sequence lengths. For larger trees our performance deteriorates.
Figure 3.
Figure 3.
PWMs, PSSVs and conditional PSSVs for five transcription factors, calculated from primate data. The y-axis in all cases is the number of bits in the information content of the log, ranging from 0 to 2. In conditional PSSVs, the blue highlight is the ancestral nucleotide upon which the PSSV is conditioned, and the yellow highlights show the Jensen–Shannon divergence between the two different conditional PSSVs (JSD × 5 is plotted for clarity).
Figure 4.
Figure 4.
A PWM constructed from human + orthologous sequence from four other primates is weaker than a human-only PWM, suggesting some turnover of TFBS; the PSSV is weaker than the five-primate PWM.
Figure 5.
Figure 5.
PWMs and PSSVs for the genuine CTCF motif and three scrambled versions, from motif instances found in ChIP-seq peaks for CTCF, and from motif instance found randomly in the genome.
Figure 6.
Figure 6.
Loss in information for PSSVs compared to PWMs, for CTCF and three scrambled version, in ChIP-seq peaks and in random genomic sequence.
Figure 7.
Figure 7.
PWMs, PSSVs and conditional PSSVs are shown for site matches for three scrambled versions of the CTCF motif. All are conditioned on the position corresponding to the one highlighted in figure 3, conditional PSSVs 1a and 1b.
Figure 8.
Figure 8.
Comparison of PSSVs obtained from the HKY85 model on five primates, the F81 model on primates, and the F81 model on seven mammal species.

Similar articles

References

    1. Stormo GD, Hartzell GW 3rd. 1989. Identifying protein-binding sites from unaligned DNA fragments. Proc. Natl Acad. Sci. USA 86, 1183-1187. (10.1073/pnas.86.4.1183) - DOI - PMC - PubMed
    1. Sharon E, Lubliner S, Segal E. 2008. A feature-based approach to modeling protein–DNA interactions. PLoS Comput. Biol. 4, e1000154. (10.1371/journal.pcbi.1000154) - DOI - PMC - PubMed
    1. Siddharthan R. 2010. Dinucleotide weight matrices for predicting transcription factor binding sites: generalizing the position weight matrix. PLoS ONE 5, e9722. (10.1371/journal.pone.0009722) - DOI - PMC - PubMed
    1. Kulakovskiy IV, Levitsky VG, Oschepkov DG, Vorontsov IE, Makeev VJ. 2013. Learning advanced TFBS models from chip-seq data-diChIPMunk: effective construction of dinucleotide positional weight matrices. In Int. Conf. on Bioinformatics Models, Methods and Algorithms, vol. 2, pp. 146–150. Setúbal, Portugal: SciTePress, Science and Technology Publications.
    1. Omidi S, Zavolan M, Pachkov M, Breda J, Berger S, van Nimwegen E. 2017. Automated incorporation of pairwise dependency in transcription factor binding site prediction using dinucleotide weight tensors. PLoS Comput. Biol. 13, e1005176. (10.1371/journal.pcbi.1005176) - DOI - PMC - PubMed

LinkOut - more resources