Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jul 12;13(1):4030.
doi: 10.1038/s41467-022-31643-3.

Deciphering polymorphism in 61,157 Escherichia coli genomes via epistatic sequence landscapes

Affiliations

Deciphering polymorphism in 61,157 Escherichia coli genomes via epistatic sequence landscapes

Lucile Vigué et al. Nat Commun. .

Abstract

Characterizing the effect of mutations is key to understand the evolution of protein sequences and to separate neutral amino-acid changes from deleterious ones. Epistatic interactions between residues can lead to a context dependence of mutation effects. Context dependence constrains the amino-acid changes that can contribute to polymorphism in the short term, and the ones that can accumulate between species in the long term. We use computational approaches to accurately predict the polymorphisms segregating in a panel of 61,157 Escherichia coli genomes from the analysis of distant homologues. By comparing a context-aware Direct-Coupling Analysis modelling to a non-epistatic approach, we show that the genetic context strongly constrains the tolerable amino acids in 30% to 50% of amino-acid sites. The study of more distant species suggests the gradual build-up of genetic context over long evolutionary timescales by the accumulation of small epistatic contributions.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Schematic representation of the sequence landscape and its relation to sequence data.
The landscape is defined via a real-valued function of any aligned sequence, with low values indicating “good” functional sequences (green area), and high values “bad” non-functional sequences (red area). Natural sequences can be seen as samples of low values: close orthologs (light blue) of a reference sequence (in white) form a sample which is localized in sequence space and surrounded by closely diverged species (mid-blue). Distantly diverged homologs (dark blue) form a global sample. All sequence data are aligned relative to the reference sequence. Within our work, the global sample will be used to infer data-driven landscape models for all proteins families present in the E. coli core genome, and the variability of the local sample and the closely diverged species will be analyzed for signatures of selection, epistasis and context dependence of natural amino-acid polymorphisms.
Fig. 2
Fig. 2. Predicted effects of observed amino acids using an IND model (neglecting epistasis) or a DCA model (incorporating pairwise epistasis).
a Rank of native amino acid in the reference strain as compared to all 20 possible amino acids. DCA model (red) outperforms IND (yellow) by predicting twice as many native amino acids to be the best possible. b DCA rank of major and minor allele for all sites that are polymorphic at a >5% threshold, among all 20 possible amino acids. Major alleles (alleles at frequencies >50%, in red) have better ranks than minor alleles (alleles at frequencies between 5 and 50%, in pink). The distribution of consensus alleles peaks at the first rank (46.2% of polymorphic sites have major allele ranking first and 17.6% have second-best rank) while the distribution of minor alleles peaks at the second rank (13.3% have the best rank against 17.6% that are second-best). c IND rank of major and minor allele for all sites that are polymorphic at a >5% threshold, among all 20 possible amino acids. As with DCA, major alleles (in orange) have better ranks than minor alleles (in yellow) and the distribution of consensus alleles peaks at the first rank. However, the distribution is spread towards greater ranks (only 24.1% of polymorphic sites have major allele ranking first and 15.5% have second-best rank, similarly minor alleles rank first in 9.6% and second-best in 13.3% of polymorphic sites) compared to DCA ranking. d Distribution of DCA scores of non-synonymous polymorphisms observed at frequencies >5% across the >60,000 strains (blue) compared to mutations sampled from an IND model (yellow) or to random mutations (gray). A large number of possible mutations are predicted to be highly deleterious (positive scores) compared to naturally occurring polymorphisms that tend to be neutral (blue distribution centered on zero). Polymorphisms predicted from IND are slightly deleterious once epistasis is taken into account (yellow distribution shifted towards positive values). Boxplot center lines represent medians, box limits are upper and lower quartiles, whiskers extend to show the rest of the distribution within an 1.5 × interquartile range, outliers are represented with points; sample size is 3477 mutations for each of the three groups.
Fig. 3
Fig. 3. Predicting the variability of amino-acid sites.
a Entropy quantifies the level of variability of an amino-acid site from conserved (entropy ~ 0) to highly variable (entropy ~ 4). It can be computed from a non-epistatic model (Context-Independent Entropy (CIE), yellow) i.e., from the frequencies of amino acids observed across distant species, or from an epistatic model (Context-Dependent Entropy (CDE), red) i.e., from the conditional probabilities of observing each amino acid in E. coli background. Residues that have strong epistatic interactions with others will be lowly polymorphic once the genetic context is fixed (low CDE) but can vary between species (high CIE) by co-evolving with their partners (cf. hatched residues). b Bivariate histogram of CDE and CIE for all sites in the dataset. Two populations of sites are clearly recognizable, in particular separated by their CDE values. c Marginal distributions of CDE (red) and CIE (yellow) for all sites in the dataset. CDE divides amino-acid sites into two populations of similar sizes: conserved (CDE < 1) and variable (CDE ≥ 1). On the contrary, most of the amino-acid sites have a high CIE, i.e., IND predicts them to be highly variable.
Fig. 4
Fig. 4. Predicting amino-acid sites that are conserved or polymorphic in E. coli. Comparison of the performance of IND and DCA models.
a Bivariate histogram of CDE and CIE for sites that are conserved across >60,000 strains of E. coli. Most of them cluster on the left peak of low CDE. b Bivariate histogram of CDE and CIE for sites that are polymorphic at a 5% threshold across >60,000 strains of E. coli. Most of them cluster on the right peak of high CDE. c Distribution of CIE for conserved (green) and polymorphic (blue) sites in E. coli. A non-epistatic model fails at distinguishing between both populations. Most of the sites are predicted to have a high entropy so to be highly variable, including those that display no mutation in >60,000 strains of E. coli (green distribution). d Distribution of CDE for conserved (green) and polymorphic (blue) sites in E. coli. A model that incorporates pairwise epistasis predicts a low entropy for conserved sites (the green distribution peaks near 0) and a high entropy for variable sites (the blue distribution peaks near 4).
Fig. 5
Fig. 5. Quantifying the effect of the context in reducing amino-acid site variability.
a The genetic background is expected to differentially impact amino-acid sites. It has a low influence on sites that have the same level of variability in E. coli and across distant species (blue and light green). On the contrary, it strongly impacts sites that are variable across distant species but are conserved in E. coli due to local epistatic couplings (dark green). b Information gain quantifies the difference between an amino-acid site variability across distant species and its potential variability in E. coli. Sites that are variable across distant species (CIE ≥ 1) but conserved in E. coli (CDE < 1) are the ones with the highest information gains (dark green distribution). Note that the information gain is given in bits, 1 bit corresponds to an effective reduction of the available amino acids by a factor 2, 2 bits by a factor 4, and 3 bits by a factor 8.
Fig. 6
Fig. 6. Epistasis observed in E. coli.
a Mutational effect ΔEij of observed double mutations with respect to the reference, plotted against the sum ΔEi + ΔEj of the individual mutation scores. The absence of clear deviations from the diagonal reveals the lack of strong epistatic couplings between pairs of mutations in our strain dataset. b Histogram of the effective proportion of sites coupled with a given amino acid. It is computed from the inverse participation ratio: 1/(IPR × proteinlength). The median of the distribution is 24%, meaning that amino-acid sites are generally coupled to about one-fourth of the other residues in the protein according to DCA modeling of epistasis.
Fig. 7
Fig. 7. Epistasis between fixed differences in a panel of diverged species.
a Phylogenetic tree of studied strains. Tree built from an amino-acid sequence alignment of 878 core genes. b DCA epistatic cost decreases with divergence. It is defined as the difference between the total change in statistical energy between pairs of sequences and the sum of single-mutant effects. Negative values correspond to positive epistasis: mutations are more beneficial (lower DCA score) taken altogether than the sum of their individual effects. Boxplot center lines represent medians, box limits are upper and lower quartiles, whiskers extend to show the rest of the distribution within an 1.5 × interquartile range, outliers are represented with points. Sample sizes are n = 22,352 for <5%, n = 15,870 for 5−10%, n = 10,810 for 10−15%, n = 6776 for 15−20%, n = 3564 for 20−25%, n = 3432 for >25%. c Distribution of epistatic couplings between pairs of fixed differences between E. coli and Y. pestis. The distribution is shifted towards negative values corresponding to positive epistatic couplings between fixed differences: they are better together than the sum of their individual effects. The relative small values of these couplings as compared to overall epistatic scores measured between entire sequences (b) indicate that epistatic patterns build up gradually by an accumulation of many small couplings.
Fig. 8
Fig. 8. Epistatic couplings between amino-acid differences that have fixed between E. coli and Y. pestis in rplK gene.
a Distribution of epistatic couplings between pairs of fixed differences. The left tail of negative DCA scores signals an over-representation of positive epistatic couplings. Boxplot center line represents the median, box limits are upper and lower quartiles, whiskers extend to show the rest of the distribution within an 1.5 × interquartile range, outliers are represented with points, sample size is n = 91 couplings. b Joint distribution of epistatic couplings values between pairs of residues harboring a fixed difference and their physical distance in the 3D structure of the protein. The strongest couplings correspond to residues that are in contact (<10 Å). However, most of the couplings involve residues that are more distant than 10 Å. c Representation of the 3D structure of the protein encoded by rplK: residues that differ between E. coli and Y. pestis are highlighted with red spheres. Most of the fixed differences cluster together in the same domain, explaining why we observe a strong epistatic signal even though most of the pairs of fixed differences are not in physical contact.

References

    1. Mayr, E. How to carry out the adaptationist program? The American Naturalist121, 324–334 (1983).
    1. Kimura, M. The Neutral Theory of Molecular Evolution (Cambridge University Press, 1983).
    1. Starr TN, Thornton JW. Epistasis in protein evolution. Protein Science. 2016;25:1204–1218. doi: 10.1002/pro.2897. - DOI - PMC - PubMed
    1. Shah P, McCandlish DM, Plotkin JB. Contingency and entrenchment in protein evolution under purifying selection. Proc. Natl Acad. Sci. USA. 2015;112:E3226–E3235. doi: 10.1073/pnas.1412933112. - DOI - PMC - PubMed
    1. Breen MS, Kemena C, Vlasov PK, Notredame C, Kondrashov FA. Epistasis as the primary factor in molecular evolution. Nature. 2012;490:535–538. doi: 10.1038/nature11510. - DOI - PubMed

Publication types