Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Feb;602(7895):101-105.
doi: 10.1038/s41586-021-04269-6. Epub 2022 Jan 12.

Mutation bias reflects natural selection in Arabidopsis thaliana

Affiliations

Mutation bias reflects natural selection in Arabidopsis thaliana

J Grey Monroe et al. Nature. 2022 Feb.

Erratum in

  • Author Correction: Mutation bias reflects natural selection in Arabidopsis thaliana.
    Monroe JG, Srikant T, Carbonell-Bejerano P, Becker C, Lensink M, Exposito-Alonso M, Klein M, Hildebrandt J, Neumann M, Kliebenstein D, Weng ML, Imbert E, Ågren J, Rutter MT, Fenster CB, Weigel D. Monroe JG, et al. Nature. 2023 Aug;620(7973):E13. doi: 10.1038/s41586-023-06387-9. Nature. 2023. PMID: 37495701 Free PMC article. No abstract available.

Abstract

Since the first half of the twentieth century, evolutionary theory has been dominated by the idea that mutations occur randomly with respect to their consequences1. Here we test this assumption with large surveys of de novo mutations in the plant Arabidopsis thaliana. In contrast to expectations, we find that mutations occur less often in functionally constrained regions of the genome-mutation frequency is reduced by half inside gene bodies and by two-thirds in essential genes. With independent genomic mutation datasets, including from the largest Arabidopsis mutation accumulation experiment conducted to date, we demonstrate that epigenomic and physical features explain over 90% of variance in the genome-wide pattern of mutation bias surrounding genes. Observed mutation frequencies around genes in turn accurately predict patterns of genetic polymorphisms in natural Arabidopsis accessions (r = 0.96). That mutation bias is the primary force behind patterns of sequence evolution around genes in natural accessions is supported by analyses of allele frequencies. Finally, we find that genes subject to stronger purifying selection have a lower mutation rate. We conclude that epigenome-associated mutation bias2 reduces the occurrence of deleterious mutations in Arabidopsis, challenging the prevailing paradigm that mutation is a directionless force in evolution.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Identifying epigenomic and other features associated with mutations in Arabidopsis.
a, Experimental design for identifying germline and somatic mutations in the main dataset. b, Relaxed purifying selection in de novo mutation calls: rates of non-synonymous (non-syn) and stop codon variants (stop) as compared with polymorphisms detected in 1,135 natural accessions from the 1001 Genomes (1001G) project and to a null model based on mutation spectra and nucleotide composition of coding sequences. Comparison of de novo mutations between genes predicted to have or not have lethal effects when mutated is also shown. P values from χ2 test; *P < 0.05. NS, not significant. c, Genome-wide distributions in gene body density, observed mutation rates and candidate predictive features in relation to transcription start sites (TSS) and transcription termination sites (TTS). Darker shading represents greater density. SNV, single-nucleotide variant; CHGm, CHHm, CGm, methylation in the CHG, CHH and CG contexts, respectively. d, Modelling approach to predict mutation probability from a range of features. ATAC-seq, assay for transposase-accessible chromatin using sequencing; AIC, Akaike information criterion. e, Predictive models and t-values of predictor variables from the generalized linear model.
Fig. 2
Fig. 2. Lower mutation rate in gene bodies.
a, Mutation probability score (predicted SNVs plus indels per base pair from models in Fig. 1e; mean ± 2 s.e.m. in grey) based on epigenomic states and mutations observed in original mutation accumulation lines. b, Observed de novo mutations from all independent mutation accumulation datasets (mean ± 2 s.e.m. in grey, bootstrapped). c, Segregating polymorphisms (SNVs plus indels, S, mean ± 2 s.e.m. in grey, bootstrapped) in 1,135 Arabidopsis accessions. d, Tajima’s D calculated from polymorphisms in Arabidopsis accessions around TSS and TTS (mean ± 2 s.e.m. in grey). Note that these TSS and TTS plots do not consider gene length or intergenic distances and that, for example, not all sequences downstream of TSSs are genic sequences, and not all sequences upstream of TSSs are intergenic sequences. Specifically, we did not distinguish between intergenic regions (or genes) longer or shorter than 3,000 bp.
Fig. 3
Fig. 3. Lower mutation probability in essential genes.
a, Variation in epigenome-derived mutation probability scores in coding sequence (CDS) among genes and gene ontology terms enriched in genes in the top (‘high-mutation probability genes’) and bottom (‘low-mutation probability genes’) deciles. Mm., macromolecular; N2, nitrogen; nucleob., nucleobase-containing compound; reg., regulation; resp., response. b, Enrichment of epigenomic and other features in coding sequences of 719 genes known to be essential from mutant analyses (mean ± 2 s.e.m.). c, Total observed mutation rate (±2 s.e.m., bootstrapped) in genes (n = 2,339) with experimentally determined functions. The bars are coloured according to relative differences in mutation rates among genes classified by function (that is, orange refers to high mutation rate and blue represents low mutation rate). P ≈ 0 for both CDS and intron mutations.
Fig. 4
Fig. 4. Adaptive reduction in deleterious mutations.
a, Correlations between epigenomic and other features, predicted and observed mutation rates, and measures of evolutionary constraint and rates of sequence evolution. Synonymous (Ps) and non-synonymous polymorphism (Pn) in natural populations, synonymous (Ds) and non-synonymous divergence (Dn) from Arabidopsis lyrata, environmental variance of gene expression (Ve expr.) and genetic variance of gene expression (Vg expr.). ‘Pred. mut.’ is the predicted mutation rate as a function of epigenomic and other features. ‘Obs. mut.’ is the observed mutation rate in genes based on de novo mutations called across all mutation accumulation datasets. ***P < 2 × 1016, **P < 0.05. b, c, Relationship between H3K4me1 (b) and estimates of evolutionary constraint and rate of sequence evolution (c) across quantiles of observed mutation rates per gene. Pearson correlation reflects raw correlation across genes. Data are visualized by mean values ± 2 s.e.m. in 50 quantiles (each quantile = 2% of genes). d, Conceptual diagrams summarizing our findings.
Extended Data Fig. 1
Extended Data Fig. 1. Workflow and quality control of de novo mutation identification.
a, Filtering pipeline. b, High-quality de novo mutations called in this study on the original mutation accumulation experiment data (107 replicate lineages of Col-0, total number = mutations from this study plus ref. ). c, Visualization of raw data and properties of high-confidence set. d, Estimated probability of mutation calls being erroneous based on alternative and total read depths in the high-confidence set and variants removed by filtering.
Extended Data Fig. 2
Extended Data Fig. 2. Summary of observed de novo mutations and distribution across original Col-0 mutation accumulation (MA) lines.
a, De novo mutations detected in genic regions (genes ± 1,000 bp) in individual MA lines. SNVs in light blue, InDels in magenta. Our investigation was focused on mutations in and around genes, so for clarity mutations elsewhere (i.e., near centromeres) are not shown. b, Distribution of mutations across genic regions per 2 Mb windows. Vertical black lines in the lower plot mark the location of genes. c, Mutation rates in lethal- and non-lethal-effect genes (n = 27,206 genes, mean ± 2 s.e.m., two-sided t-test) d, Frequencies of single nucleotide transitions and transversions. e, Distribution of frequency of specific mutations across lines. f, Number of germline and somatic mutations detected in each MA line. g, Distribution of alternative allele read depth for putative somatic mutations. h, Relationship between number of detected mutations and total sequencing depth (total number of informative reads in variant sites) in MA lines. i, Size distribution of insertions and deletions.
Extended Data Fig. 3
Extended Data Fig. 3. Sequencing depth, mappability, and false positives do not explain observed biases in distributions of natural polymorphisms or observed mutations used to predict mutation probabilities.
a, Sequencing depth around transcription start (TSS) and termination (TTS) sites in one randomly chosen mutation accumulation line. b, Mappability around TSS TTS site calculated with GenMap. c, Rates of false positive SNP and InDel calls around TSS and TTS determined from 1,000 iterations of simulated Illumina reads. d, Simulation of effect of selection on gene bodies. Selection could take the form of mutations being dominant lethal or through somatic competition of mutations with small selection coefficients. 0 = 0%, 0.01 = 1%, 0.1 = 10%, 0.2 = 20%, 0.3 = 30% of gene body mutations removed by purifying selection. 30% is estimated to be the approximate upper bound of constrained sites in gene bodies. ei, Resequencing of 10 siblings of one MA line from ref. . e, Overview of experimental design for testing the effect of sequencing depth on calling somatic mutations. f, Filtered heterozygous variants (SNVs and InDels) called in sibling #5 sequenced at ~600x depth overlap more with variants called from sibling #5 at ~60x sequencing depth than with other siblings at ~60 sequencing depth. The boxplots show the distribution of 20 iterations of sampling equal numbers of heterozygous variants (to account for differences in total number of variants called in different siblings) for each sibling sequenced at ~60x and compared to sibling 5 sequenced at ~600x. Boxplots show median with maxima and minima reflecting interquartile range (IQR), whiskers = 1.5 * IQR (n = 20 iterations). g, Frequency distribution of unfiltered heterozygous variants called in 10 siblings sequenced at ~60x depth each. Note that because these siblings are descendants of 25 generations of self-fertilization, the number of true heterozygous (inherited segregating) calls is expected to be very small compared to heterozygous variants that are chimeric somatic mutations. h, Average mappability of variants detected in different numbers of siblings out of the 10 sequenced siblings. i, Variants called independently in one sibling or, less so, in two to four siblings show signatures of mutation bias. In contrast, variants called in five or more siblings (which should more likely be false positives due to cryptic duplications or regions with poor mappability) do not show a biased distribution around TSS and TSS, with overall distribution similar to known false positives.
Extended Data Fig. 4
Extended Data Fig. 4. Variants called in additional mutation accumulation datasets.
a, Germline and somatic mutations around gene regions in mutation accumulation lines derived from eight founder genotypes. For each founder, 35-60 lines were propagated for 8-10 generations. b, The proportion of somatic variants detected in gene bodies (gene body/(gene body + upstream + downstream)) among descendants of the same founder. F- and p-values from one-way ANOVA. (n = 400 unique mutation accumulation lines, mean ± 2 s.e.m.) c, Somatic variants detected from reanalysis of 64 individual leaves from two Col-0 plants. d, Germline variants detected in a bottlenecked A. thaliana lineage following colonization of North America since ~1600 (ref. ). e, Epigenome-predicted and observed mutation rates across all datasets for InDels and SNVs, comparing gene bodies (GB) with upstream/downstream (U/D) regions. P-values from chi-squared tests.
Extended Data Fig. 5
Extended Data Fig. 5. Relationships between epigenome-predicted mutation probability, observed de novo mutations, polymorphisms in natural populations, and Tajima’s D in natural populations.
These data show the quantitative relationships apparent in Fig. 2 of the main text. Each point reflects the value in one window of 1,200 calculated windows across all 33,056 genes, in relation to genome-wide transcription start and termination sites (TSS, TTS). Error bars indicate ±2 s.e.m. confidence intervals. For epigenome-predicted mutation probability scores, each point reflects the mean ±2 s.e.m. across all genes. For observed de novo mutations, each point reflects the total number of mutations ±2 s.e.m. (bootstrapped). For polymorphisms, each point reflects the total number of variants ±2 s.e.m. (bootstrapped). For Tajima’s D, each point reflects the mean ±2 s.e.m. Tajima’s D is already predicted by existing theory to be negatively correlated with mutation rate, as regions with higher mutation rate will be enriched for newer and therefore rarer variants.
Extended Data Fig. 6
Extended Data Fig. 6. Effects of mutation rate and selection heterogeneity on polymorphisms and Tajima’s D along genes.
Simulation results from SLiM for the first 100 genes of chromosome 1 using a population of 1,000 individuals and 10,000 generations. ac, Average correlation between 200 permutations of simulated scenarios and observed patterns of variation in natural A. thaliana accessions. a, Parameter choice: Difference between mutation rate (dm) in gene bodies and intergenic space (e.g., 0.5 = 50% reduction in mutation rate) and proportion of mutations that are deleterious in the genic (gds) and intergenic (ids) regions. The parameter combinations shown in dh are highlighted with red outlines. b, Pearson correlation coefficients (ppc) comparing Tajima’s D values from each simulation to that of observed data in natural A. thaliana accession. c, Pearson correlation coefficients (pcc) comparing number of polymorphisms accumulated in each simulation to that of observed data in wild Arabidopsis accessions. dk, Examples of polymorphism (red) and Tajima’s D (blue) in relation to gene bodies (TSS, TTS) averaged from 200 permutations of a scenario approximating empirical estimates of mutation rate heterogeneity and selection heterogeneity between gene bodies and intergenic space. Parameters (see a) given for each scenario. Strong purifying selection in gene regions alone (with equal mutation rates between gene bodies and intergenic space), which also reduces levels of polymorphism in gene bodies, causes more negative Tajima’s D values in gene bodies, which is inconsistent with observed data in natural A. thaliana accessions.
Extended Data Fig. 7
Extended Data Fig. 7. Relationships between untranslated regions or introns and mutation rates.
a, Distribution of epigenomic features in genes with different numbers of exons. b, Epigenome-predicted Mutation Probability Score (MPS), rates of natural polymorphism, and Tajima’s D in genes with different numbers of exons. c, Left: comparison of Mutation Probability Score (MPS) between genes with UTRs and those lacking 5’ or 3’ UTRs. Horizontal lines mark the mean difference between genes with and without UTRs. Vertical lines mark the mean +- confidence intervals of two-sided t-tests. Center: rates of natural polymorphism in natural A. thaliana accessions. Right: Tajima’s D in natural accessions. (n = 35,526 gene models). d, Left: Pearson’s correlation coefficients for relationship between predicted mutation probabilities and the absolute length of 5’ and 3’ UTRs. Horizontal lines mark the means, and vertical lines mark the mean +- confidence intervals. Center: same for rates of natural polymorphism in natural accessions. Right: same for Tajima’s D in natural accessions. (n = 35,526 gene models). e, Left: Relationships between intron number and total intron length with predicted mutation probabilities. Points indicate mean values. Center: same for rates of natural polymorphism in natural accessions. Right: same for Tajima’s D in natural accessions. f, Results for 544 Populus trichocarpa accessions. Horizontal lines mark the means, and vertical lines mark the mean ± confidence intervals of two-sided t-tests (n = 73,013 gene models).
Extended Data Fig. 8
Extended Data Fig. 8. Epigenomic and other features and mutation rates of lethal-effect and constitutively expressed genes.
a, Enrichment of features in coding sequences of “lethal-effect” genes (n = 2,720 lethal-effect genes, mean ±2 s.e.m.). b, Total mutation rate (±2 s.e.m., bootstrapped) in lethal- and non-lethal-effect genes (n = 27,206 genes). c, Enrichment of features in coding sequences of constitutively (across all tissues) expressed genes (n = 9,957 genes, mean ± 2 s.e.m.). d, Total mutation rate (±2 s.e.m., bootstrapped) in genes binned according to the number of tissues in which they are expressed (n = 25,987 genes with tissue-specific expression data).
Extended Data Fig. 9
Extended Data Fig. 9. Predicted mutation rates and evidence of selection on natural polymorphisms versus de novo mutations across gene regions.
a, Relationship between Tajima’s D of gene bodies and coding region selection estimated by Pn/Ps (n = 21,407 genes) and b, Dn/Ds (n = 21,407 genes). c, Epigenome-predicted mutation probability in different gene features. d, Scaled residuals ((Obs-Pred)/Pred) from S ~ u. Significantly negative residuals in coding regions are consistent with purifying selection in natural populations acting on new mutations. e, Relationships between epigenome-predicted mutation probability and other estimates of constraint. Residuals between predicted mutation rate and observed mutations are positively correlated with predicted mutation rate indicating that genes subject to purifying selection are predicted to mutate less. Genes with low predicted mutation rates are also less likely to have alpha > 0, a measure of variants under positive selection. Genes with low predicted mutation rate are depleted in natural populations for non-synonymous variants that reach fixation, as measured by the Neutrality Index, and for loss-of-function variants.
Extended Data Fig. 10
Extended Data Fig. 10. Estimates of Lsegment for different regions.
a, Length of sequence space (Lsegment) reflecting different types of regions. b, Test of parameter space that satisfies population genetic theoretical predictions for the possibility for targeted hypomutation to evolve. OOM = orders of magnitude. Selection on intragenomic mutation rate variation will be effective, when Ne * u * s * du * pd * Lsegment > 1 where Ne is the effective population size, u is the mutation rate, s is the average selection coefficient on deleterious mutations, du is the degree of change in mutation rate, pd is the proportion of sites subject to purifying selection, and Lsegment is the region of the genome affected. Assuming an effective population size of ~300,000 (ref. ), a mutation rate of ~10−8 (ref. ), an average selection coefficient of 0.01 (ref. ), an order-of-magnitude reduction in mutation rate, and functionally constrained regions where 20% of sites are under selection, the total length of the sequence affected, Lsegment, would have to be at least ~200 kb, which (accounting for differences in effective population size) is similar to previous estimates in humans. For perspective, this minimum Lsegment is considerably shorter (~1.5%) than the sum of coding regions with elevated levels of H3K4me1 (top quartile is ~13 Mb, or 15% of the genome), a feature enriched in gene bodies and essential genes and associated with lower mutation rate. Thus, selection is expected to act with high efficiency on variants that cause DNA repair and protection mechanisms to preferentially target such regions.

Comment in

References

    1. Futuyma, D. J. Evolutionary Biology 2nd edn (Sinauer, 1986).
    1. Martincorena I, Luscombe NM. Non-random mutation: the evolution of targeted hypermutation and hypomutation. Bioessays. 2013;35:123–130. doi: 10.1002/bies.201200150. - DOI - PubMed
    1. Lynch M, et al. Genetic drift, selection and the evolution of the mutation rate. Nat. Rev. Genet. 2016;17:704–714. doi: 10.1038/nrg.2016.104. - DOI - PubMed
    1. Stoletzki N, Eyre-Walker A. The positive correlation between dN/dS and dS in mammals is due to runs of adjacent substitutions. Mol. Biol. Evol. 2011;28:1371–1380. doi: 10.1093/molbev/msq320. - DOI - PubMed
    1. Hodgkinson A, Eyre-Walker A. Variation in the mutation rate across mammalian genomes. Nat. Rev. Genet. 2011;12:756–766. doi: 10.1038/nrg3098. - DOI - PubMed

Publication types