Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Mar;603(7901):455-463.
doi: 10.1038/s41586-022-04506-6. Epub 2022 Mar 9.

The evolution, evolvability and engineering of gene regulatory DNA

Affiliations

The evolution, evolvability and engineering of gene regulatory DNA

Eeshit Dhaval Vaishnav et al. Nature. 2022 Mar.

Abstract

Mutations in non-coding regulatory DNA sequences can alter gene expression, organismal phenotype and fitness1-3. Constructing complete fitness landscapes, in which DNA sequences are mapped to fitness, is a long-standing goal in biology, but has remained elusive because it is challenging to generalize reliably to vast sequence spaces4-6. Here we build sequence-to-expression models that capture fitness landscapes and use them to decipher principles of regulatory evolution. Using millions of randomly sampled promoter DNA sequences and their measured expression levels in the yeast Saccharomyces cerevisiae, we learn deep neural network models that generalize with excellent prediction performance, and enable sequence design for expression engineering. Using our models, we study expression divergence under genetic drift and strong-selection weak-mutation regimes to find that regulatory evolution is rapid and subject to diminishing returns epistasis; that conflicting expression objectives in different environments constrain expression adaptation; and that stabilizing selection on gene expression leads to the moderation of regulatory complexity. We present an approach for using such models to detect signatures of selection on expression from natural variation in regulatory sequences and use it to discover an instance of convergent regulatory evolution. We assess mutational robustness, finding that regulatory mutation effect sizes follow a power law, characterize regulatory evolvability, visualize promoter fitness landscapes, discover evolvability archetypes and illustrate the mutational robustness of natural regulatory sequence populations. Our work provides a general framework for designing regulatory sequences and addressing fundamental questions in regulatory evolution.

PubMed Disclaimer

Conflict of interest statement

Conflict of Interest statement

A.R. is a co-founder and equity holder of Celsius Therapeutics, an equity holder in Immunitas, and until July 31, 2020 was an S.A.B. member of ThermoFisher Scientific, Syros Pharmaceuticals, Neogene Therapeutics and Asimov. From August 1, 2020, A.R. is an employee of Genentech. The other authors declare no competing interests.

Figures

Extended Data Fig. 1:
Extended Data Fig. 1:. The convolutional sequence-to-expression model generalizes reliably and helps characterize sequence trajectories under different evolutionary regimes.
(a-d) Prediction of expression from sequence in complex (YPD) (a-b) and defined (SD-Uracil) (c-d) media. Predicted (x axis) and experimentally measured (y axis) expression for (a,c) random test sequences (sampled separately from and not overlapping with the training data) and (b,d) native yeast promoter sequences containing random single base mutations. Top left: Pearson’s r and associated two-tailed p-value. Compression of predictions in the lower left results from binning differences during cell sorting in different experiments (Supplementary Notes). e, Experimental validation of trajectories from simulations of random genetic drift. Distribution of measured (light grey) and predicted (dark gray) changes in expression in the defined media (SD-Uracil) (y axis) for the synthesized randomly-designed sequences (n=2,986) at each mutational step (x axis). Midline: median; boxes: interquartile range; whiskers: 5th and 95th percentile range. f, g, Simulation and validation of expression trajectories under SSWM in defined media (SD-Uracil). f, Distribution of predicted expression levels (y axis) in defined media at each evolutionary time step (x axis) for sequences under SSWM favoring high (red) or low (blue) expression, starting with native promoter sequences (n=5,720). Midline: median; boxes: interquartile range; whiskers: 5th and 95th percentile range. g, Experimentally-measured expression distribution in defined media (y axis) for the synthesized sequences (n=6,304 sequences; 637 trajectories) at each mutational step (x axis) from predicted mutational trajectories under SSWM, favoring high (red) or low (blue) expression. Midline: median; boxes: interquartile range; whiskers: 5th and 95th percentile range. h-o, Experimental validation of predicted expression for sequences from the random genetic drift and SSWM simulations. Experimentally measured (y axis) and predicted (x axis) expression level (l-o) or expression change from the starting sequence (h-k) in complex (h,j,l,n) or defined (i,k,m,o) media using sequences from the random genetic drift (Fig. 2c and (e); h,i,l,m here) and SSWM (Fig. 2g and (g); j,k,n,o here) validation experiments. Top left: Pearson’s r and associated two-tailed p-values.
Extended Data Fig. 2 |
Extended Data Fig. 2 |. Characterization of sequence trajectories under strong competing selection pressures using the convolutional model.
a,b, Expression is highly correlated between defined and complex media. Measured (a) and predicted (b) expression in defined (x axis) and complex (y axis) media for a set of test sequences measured in both media. Top left: Pearson’s r and associated two-tailed p-values. c, Opposing relationships between organismal fitness and URA3 expression in two environments. Measured expression (x axis, using a YFP reporter) and fitness (y axis; when used as the promoter sequence for the URA3 gene) for yeast with each of 11 promoters predicted to span a wide range of expression levels in complex media with 5-FOA (red), where higher expression of URA3 is toxic due to URA3-mediated conversion of 5-FOA to 5-fluorouracil, and in defined media lacking uracil (blue), where URA3 is required for uracil synthesis. Error bars: Standard error of the mean (n=3 replicate experiments). d-f, Competing expression objectives are slow to reach saturation. d,e, Difference in predicted expression (y axis) at each evolutionary time step (x axis) under selection to maximize (red) or minimize (blue) the difference between expression in defined and complex media, starting with either native sequences (d, as Fig. 2h, n=5,720) or random sequences (e, n=10,000). f, Distribution of predicted expression (y axis) in complex (blue) and defined (red) media at each evolutionary time step (x axis) for a starting set of random sequences (n=10,000). Midline: median; boxes: interquartile range; whiskers: 5th and 95th percentile range. g, Motifs enriched within sequences evolved for competing objectives in different environments. Top five most enriched motifs, found using DREME (Methods) within sequences computationally evolved from a starting set of random sequences to either maximize (left) or minimize (right) the difference in expression between defined and complex media, along with DREME E-values, the corresponding rank of the same motif when using native sequences as a starting point, the likely cognate TF and that TF’s known motif.
Extended Data Fig. 3 |
Extended Data Fig. 3 |. The transformer sequence-to-expression model generalizes reliably and helps characterize sequence trajectories under different evolutionary regimes.
a-d, Prediction of expression from sequence in the complex (a-b) and defined (c-d) media. Predicted (x axis) and experimentally measured (y axis) expression for (a,c) random test sequences (sampled separately from and not overlapping with the training data) and (b,d) native yeast promoter sequences containing random single base mutations. Top left: Pearson’s r and associated two-tailed p-value. Compression of predictions in the lower left results from binning differences during cell sorting in different experiments (Supplementary Notes). e, Predicted (x axis) and experimentally measured (y axis) expression in complex media (YPD) for all native yeast promoter sequences. Pearson’s r and associated two-tailed p-values are shown. f, Predicted expression divergence under random genetic drift. Distribution of the change in predicted expression (y axis) for random starting sequences (n=5,720) at each mutational step (x axis) for trajectories simulated under random genetic drift. Silver bar: differences in expression between unrelated sequences. g,h, Comparison of the distribution of measured (light grey) and transformer model predicted (dark gray) changes in expression (y axis) in complex media (g, n=2,983) and defined media (h, n=2,986) for synthesized randomly-designed sequences at each mutational step (x axis). i,j Predicted expression evolution under SSWM. Distribution of predicted expression levels (y axis) in complex media (i, n=10,322) and defined media (j, n=6,304) at each mutational step (x axis) for sequence trajectories under SSWM favoring high (red) or low (blue) expression, starting with 5,720 native promoter sequences. (f-j) Midline: median; boxes: interquartile range; whiskers: 5th and 95th percentile range. k-r, Comparison of model predicted expression for sequences synthesized previously for the random genetic drift and SSWM analyses. Experimentally measured (y axis) and transformer model predicted (x axis) expression level (o-r) or expression change from the starting sequence (k-n) in complex (k,m,o,q) or defined (l,n,p,r) media using sequences from the random genetic drift (Fig. 2c and (Extended Data Fig. 1e); k,l,o,p here) and SSWM (Fig. 2g and (Extended Data Fig. 1g); m,n,q,r here) validation experiments. Top left: Pearson’s r and associated two-tailed p-values.
Extended Data Fig. 4 |
Extended Data Fig. 4 |. Signatures of stabilizing selection on gene expression detected from regulatory DNA across natural populations.
a, Expression-altering alleles in the CDC36 promoter are attributed primarily to altered UPC2 binding. TF interaction strength (expression attributable to each TF) difference between the high and low alleles (each point is a TF) for each of two low expression alleles (allele 1: x axis; allele 2: y axis). Each low-expressing allele is compared to the high-expression allele with the most similar sequence (across all promoter sequences analyzed from the 1,011 strains; eTF,AhigheTF,Alow). b, Distribution of ECC (y axis, calculated from 1,011 S. cerevisiae genomes, top left) for S. cerevisiae genes whose orthologs have divergent (blue) or conserved (purple) expression (within Saccharomyces (left, n=4,191), Ascomycota (middle, n=4,910), or mammals (right, n=199) (as determined by cross species RNA-seq, top right). p-values: two-sided Wilcoxon rank-sum test. Midline: median; boxes: interquartile range; whiskers: 5th and 95th percentile range. c, Determination of expression change threshold for defining a “tolerated mutation” to compute mutational robustness. We used all genes with an ECC consistent with stabilizing selection (ECC>0; left), calculated the variance in predicted expression across the 1011 yeast strains for each gene, and chose the tolerable mutation threshold, ϵ, as two standard deviations of the distribution of the variance (right). ~73% of genes with ECC>0 had an expression variation lower than ϵ. d, Distribution of the effects of mutations (magnitude) on expression for all native regulatory sequences follows a power law with an exponent of 2.252. Shaded regions are equal in area.
Extended Data Fig. 5 |
Extended Data Fig. 5 |. Fitness responsivity of a gene as the total variation of its expression-to-fitness relationship FGENE curves.
Expression (x axis) and fitness (y axis) levels for different promoter variants for each select gene fit from experimental measurements by Keren et al. Fitness responsivity calculated as the total variation in each curve is noted above each panel.
Extended Data Fig. 6 |
Extended Data Fig. 6 |. Analysis of regulatory evolvability reveals sequence-encoded signatures of expression conservation from solitary sequences.
a, Selection of optimal number of archetypes. Mean-square-reconstruction error (y axis) for reconstructing the evolvability vectors from the embeddings learned by the autoencoder for an increasing number of archetypes (x axis). Red circle: optimal number of archetypes selected as prescribed by the “elbow method”. b, The archetypal embeddings learned by the autoencoder accurately capture evolvability vectors. Original (y axis) and reconstructed (x axis) expression changes (the values in the evolvability vectors) for each native sequence (none seen by the autoencoder in training). Top left: Pearson’s r and associated two-tailed p-values. c-f, Evolvability space captures regulatory sequences’ evolutionary properties. Proximity to the malleable archetype (Amalleable) (x axis) and mutational robustness (c,e y axis) or ECC (d,f y axis) for all yeast genes (e,f) or the gene for which fitness responsivity was quantified (c,d). Top right: Spearman’s ρ and associated two-sided p-value. “L”-shape of relationship in e results from the robust cleft, Amaxima, and Aminima all being distal to Amalleable (left side of plot). g, All native (S288C reference) promoter sequences (points) projected onto the archetypal evolvability space learned from random sequences; colored by their ECC. Large colored circles: evolvability archetypes. h, The proximity to the malleable archetype (x axis) and fitness responsivity (y axis) for the 80 genes with measured fitness responsivity. Top right: Spearman’s ρ and associated two-tailed p-values. Light blue error band: 95% confidence interval. i, All native (S288C reference) promoter sequences (points) projected on the evolvability space learned from random sequences; colored by their mean pairwise distance in the archetypal evolvability space between all promoter alleles across the 1,011 yeast isolates for that gene (ortholog evolvability dispersion). Large colored circles: evolvability archetypes.
Extended Data Fig. 7 |
Extended Data Fig. 7 |. Visualizing promoter fitness landscapes in sequence space.
Visualizing the fitness landscapes for the promoters of HXT3 (a), ADH1 (b), GCN4 (c), RPL3 (d), FBA1 (e), TUB3 (f), URA3 (in defined media) (g), URA3 (in complex media + 5FOA) (h). 1000 promoter sequences represented by their evolvability vectors projected onto the 2D archetypal evolvability space and colored by their associated fitness as reflected by their predicted growth rate relative to wildtype (color, Methods), estimated by first mapping sequences to expression with our model and then expression to fitness as measured and estimated previously.
Extended Data Fig. 8 |
Extended Data Fig. 8 |. In silico mutagenesis (ISM) of malleable and robust promoters.
SSWM trajectories for (a) DBP7, a malleable promoter, and (b) UTH1, a robust promoter. Each subplot shows the in silico mutagenesis effects for how expression level (color) changes when mutating each position (x axis) to each of the four bases (y axis) of each sequence (subplots) in the trajectories. The DNA sequence is indicated above each wildtype subplot (indicated with “WT” at left). Arrows indicate the mutations selected at each step, which always correspond to the mutation of maximal effect; increasing expression goes up the figure from wildtype and decreasing expression goes down. Part of the malleability of the DBP7 promoter results from an intermediate-affinity Rap1p binding site (gray bar). The first mutations in increasing- and decreasing-expression trajectories either increase or decrease (respectively) the affinity of this site. The UTH1 promoter changes gradually in expression and evolves proximal repressor binding sites to dampen expression (gray bars).
Fig. 1 |
Fig. 1 |. The evolution, evolvability, and engineering of gene regulatory DNA.
a, Project overview. b, Prediction of expression from sequence using the model. b, Predicted (x axis) and experimentally measured (y axis) expression in complex media (YPD) for native yeast promoter sequences. Pearson’s r and associated two-tailed p-values are shown; dashed line: line of best fit. c, Engineering extreme expression values beyond the range of native sequences using a genetic algorithm (GA) and the sequence-to-expression model. Normalized kernel density estimates of the distributions of measured expression levels for native yeast promoter sequences (grey), and sequences designed (by the GA) to have high (red) or low (blue) expression.
Fig. 2 |
Fig. 2 |. The evolutionary malleability of gene expression.
a-c, Expression divergence under genetic drift. a. Simulation procedure. b, Predicted expression divergence. Distribution of the change in predicted expression (y axis) for random starting sequences (n=5,720) at each mutational step (x axis) for simulated trajectories. Silver bar: expression differences between unrelated sequences. c, Experimental validation. Distribution of measured (light grey) and predicted (dark gray) changes in expression in complex media (y axis) for synthesized randomly-designed sequences (n=2,983) at each mutational step (x axis). d, Stabilizing selection on gene expression leads to moderation of regulatory complexity extremes. Regulatory complexity (y axis) of sequences from sequential mutational steps (x axis) under stabilizing selection to maintain the starting expression levels, where the regulatory interactions of starting sequences are complex (blue; n=192) or simple (orange, n=172). Right bars: regulatory complexity for native (dark gray) and random (light gray) sequences. e-g, Sequences under strong-selection weak-mutation (SSWM) can rapidly evolve to expression optima. e. Simulation procedure. f, Predicted expression evolution. Distribution of predicted expression levels (y axis) in complex media at each mutational step (x axis) for trajectories favoring high (red) or low (blue) expression, starting with native promoter sequences (n=5,720). g, Experimental validation. Measured expression distribution in complex media (y axis) for the synthesized sequences (n=10,322 sequences; 877 trajectories) at each mutational step (x axis), favoring high (red) or low (blue) expression. Axis scales differ due to variation in measurement procedure (Supplementary Information). h, Competing expression objectives constrain expression adaptation. Distribution of predicted expression (y axis) in complex (blue) and defined (red) media at each mutational step (x axis) for a starting set of native promoter sequences (n=5,720) optimizing for high expression in defined (red) and simultaneous low expression in complex (blue) media. (b-d,f-h) Midline: median; boxes: interquartile range; whiskers: 5th and 95th percentile range.
Fig. 3 |
Fig. 3 |. The Expression Conservation Coefficient (ECC) detects signatures of stabilizing selection on gene expression using natural genetic variation in regulatory DNA.
a, ECC calculation from 1,011 S. cerevisiae genomes. b, ECC distribution for S. cerevisiae genes. Frequency distribution of ECC values (x axis). Dashed line separates regions corresponding to disruptive/positive selection (left) and stabilizing selection (right). GO terms enriched by the ECC ranking are shown. Arrowhead: ECC value for the CDC36 promoter sequence. c, Convergent regulatory evolution in the CDC36 promoter. Predicted expression (x axis, left bar plot) and associated number of strains (x axis, right bar plot) of all alleles among the analyzed CDC36 promoter sequence within 1,011 yeast isolates, along with an alignment of their Upc2p binding site sequences (left; Upc2p binding motif below). Red vertical lines: two independently evolved low-expressing alleles. Grey vertical boxes: key positions in the Upc2p motif with single nucleotide polymorphisms. d, Validation of CDC36 promoter allele expression and organismal phenotype. Strains (y axis) with different Upc2p binding site alleles for both model-predicted CDC36 expression (left; predicted on −170:−90 region to capture entire Upc2p binding site), measured CDC36 expression (middle), and lag phase duration (right). Points: biological replicates (n=3); bars/vertical lines: means. Bar color: strain background. Student’s t-test p-values, unpaired, equal variance, one-sided (expression, WE WT vs. C23 p=0.044, C07 p=6.69*10−3) or two-sided (lag phase, WE WT vs. C23 p=1.34*10−4, C07 p=2*10−4); *p<0.05; **p<0.01; ***p<0.001.
Fig. 4 |
Fig. 4 |. The evolvability vector captures fitness landscapes.
a, Characterizing regulatory evolvability by computing an evolvability vector. Left and middle: Generating evolvability vectors for a sequence. Right: training an autoencoder with evolvability vectors to generate a 2D representation to visualize sequences in archetypal evolvability space. b, Evolvability archetypes discovered by the autoencoder. Left: Evolvability vectors of the rank ordered (x axis) predicted change in expression (y axis) for native sequences closest to each of the malleable (green), maxima (red) or minima (blue) archetypes and the ‘robustness cleft’ (black). Right: all native yeast (S. cerevisiae S288C) promoter sequences (grey points) projected onto the archetypal evolvability space by their evolvability vectors. Evolvability archetypes (colored circles) and their closest native sequences (s1-s4 as on left) are marked. c,d, Evolvability space captures mutational robustness and expression levels. Evolvability vectors (points) of all native yeast promoter sequences projected onto the evolvability space (archetypes are large colored circles, as in b) and colored by mutational robustness (c) or predicted expression levels (d). e, ABF1 promoter fitness landscape. Evolvability vectors of promoter sequences projected onto the evolvability space and colored by computed fitness (color, Methods). f,g, Malleable promoter sequences dynamically traverse the evolvability space. Evolvability vector projections of native sequences (points) from all 1,011 S. cerevisiae isolates. Red points: natural promoter sequence variants for DBP7, the promoter closest to the malleable archetype (f) and for UTH1, the promoter closest to the robustness cleft (g). h, The robustness of native promoter sequences. Density (color) of all native yeast promoter sequences when their evolvability vectors are projected onto the evolvability space.

Comment in

References

    1. Wittkopp PJ & Kalay G Cis-regulatory elements: molecular mechanisms and evolutionary processes underlying divergence. Nat. Rev. Genet 13, 59–69 (2011). - PubMed
    1. Hill MS, Vande Zande P & Wittkopp PJ Molecular and evolutionary processes generating variation in gene expression. Nature Reviews Genetics 1–13 (2020). - PMC - PubMed
    1. Fuqua T et al. Dense and pleiotropic regulatory information in a developmental enhancer. Nature 1–5 (2020). - PMC - PubMed
    1. de Visser JAGM & Krug J Empirical fitness landscapes and the predictability of evolution. Nat. Rev. Genet 15, 480–490 (2014). - PubMed
    1. Kondrashov DA & Kondrashov FA Topological features of rugged fitness landscapes in sequence space. Trends in Genetics 31, 24–33 (2015). - PubMed

Methods References

    1. Kosuri S et al. Composability of regulatory sequences controlling transcription and translation in Escherichia coli. Proc. Natl. Acad. Sci. U. S. A 110, 14024–14029 (2013). - PMC - PubMed
    1. Shalem O et al. Systematic dissection of the sequence determinants of gene 3’ end mediated expression control. PLoS Genet 11, e1005147 (2015). - PMC - PubMed
    1. Kinney JB, Murugan A, Callan CG Jr & Cox EC Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proc. Natl. Acad. Sci. U. S. A 107, 9158–9163 (2010). - PMC - PubMed
    1. Sharon E et al. Inferring gene regulatory logic from high-throughput measurements of thousands of systematically designed promoters. Nat. Biotechnol 30, 521–530 (2012). - PMC - PubMed
    1. Melnikov A et al. Systematic dissection and optimization of inducible enhancers in human cells using a massively parallel reporter assay. Nature biotechnology 30, 271–277 (2012). - PMC - PubMed

Publication types