Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jan 7;39(1):msab321.
doi: 10.1093/molbev/msab321.

Modeling Sequence-Space Exploration and Emergence of Epistatic Signals in Protein Evolution

Affiliations

Modeling Sequence-Space Exploration and Emergence of Epistatic Signals in Protein Evolution

Matteo Bisardi et al. Mol Biol Evol. .

Abstract

During their evolution, proteins explore sequence space via an interplay between random mutations and phenotypic selection. Here, we build upon recent progress in reconstructing data-driven fitness landscapes for families of homologous proteins, to propose stochastic models of experimental protein evolution. These models predict quantitatively important features of experimentally evolved sequence libraries, like fitness distributions and position-specific mutational spectra. They also allow us to efficiently simulate sequence libraries for a vast array of combinations of experimental parameters like sequence divergence, selection strength, and library size. We showcase the potential of the approach in reanalyzing two recent experiments to determine protein structure from signals of epistasis emerging in experimental sequence libraries. To be detectable, these signals require sufficiently large and sufficiently diverged libraries. Our modeling framework offers a quantitative explanation for different outcomes of recently published experiments. Furthermore, we can forecast the outcome of time- and resource-intensive evolution experiments, opening thereby a way to computationally optimize experimental protocols.

Keywords: data-driven models; epistasis; fitness landscapes; protein evolution; sequence space.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Scheme of our evolutionary modeling approach: starting from a wildtype sequence (red), we collect a large multiple sequence alignment of naturally diverged homologs (blue), which are used to learn a generative landscape model using bmDCA (Figliuzzi et al. 2018). Evolution is simulated as a Markov process in this landscape, leading to simulated, or in silico evolved mutant sequences. These sequences can be compared with the results of evolution experiments (Fantini et al. 2020; Stiffler et al. 2020) (green), to assess estimated protein fitness (so-called statistical energies, compare below), mutational profiles, and DCA-based epistasis and contact prediction. The simulation scheme also allows for changing experimental control parameters like final sequence divergence, sequencing depth, and selection strength.
Fig. 2.
Fig. 2.
Experimental and predicted mutational effects in TEM-1: panel (A) shows the results of the deep-mutational scanning experiment of (Firnberg et al. 2014), as compared with the computational predictions using the epistatic Potts model (B) and the nonepistatic profile model (C). Panels (A) and (B) have a Spearman rank correlation of −0.77, showing that low energies correspond to high fitness. Panels (A) and (C) have a reduced Spearman correlation of −0.6 due to the absence of epistatic couplings in the profile model.
Fig. 3.
Fig. 3.
Statistical energy in dependence of sequence distance from wildtype: panel (A) shows the statistical energies of the sequences from generation 20 in Stiffler et al., as a function of the Hamming distance (number of substituted amino acids) from the wildtype PSE-1. Panel (B) shows the same quantities for the in silico simulated sequences, where selection strength T and the number of simulated evolutionary steps are adjusted to reproduce the average distance and the slope from panel (A). Panel (C) shows an example of strong selection (T1) leading to optimized sequences having lower statistical energies/higher fitness. Panel (D) shows the case of very weak selection (T1) resulting in random, mostly deleterious substitutions strongly increasing statistical energy.
Fig. 4.
Fig. 4.
Position-specific amino acid frequencies for experimental and simulated sequence libraries: panel (A) shows the frequencies fi(a) of usage of amino acid a in site i in round 20 of experimental PSE-1 evolution, panel (B) shows the same quantity for simulated evolution. The Spearman rank correlation between the two frequency spectra is 86%.
Fig. 5.
Fig. 5.
Accuracy of contact prediction as a function of sequence number and sequence divergence: panel (A) shows the accuracy of contact prediction as a function of the average sequence divergence from wildtype PSE-1 and the depth of the sequenced library. The accuracy is measured via the PPV, that is, the fraction of true positive contact predictions in the first 100 DCA-predicted contacts, compare with Materials and Methods for details. The selection strength T = 1.4 corresponds to the experimental condition in (Stiffler et al. 2020). The highlighted square indicates an average Hamming distance of about 27 and a sequence library of 165,000, as realized in (Stiffler et al. 2020). Panel (B) shows the same quantities for wildtype TEM-1, and for the experimental conditions used in (Fantini et al. 2020).
Fig. 6.
Fig. 6.
Dependence of the contact-prediction accuracy on selection strength: we show the PPV (100 predicted contacts) of simulated MSAs at variable selection strength T (panel A for PSE-1, panel B for TEM-1), and for different sequence distances from the wildtype protein. We predict that, for the distances observed in the evolution experiments (27 for PSE-1, 18 for TEM-1), both experiments would have benefited from slightly lower antibiotic concentrations.

References

    1. Ackley DH, Hinton GE, Sejnowski TJ.. 1985. A learning algorithm for Boltzmann machines. Cogn Sci. 9(1):147–169.
    1. Arnold FH. 1998. Design by directed evolution. Acc Chem Res. 31(3):125–131.
    1. Arnold FH. 2018. Directed evolution: bringing new chemistry to life. Angew Chem Int Ed Engl. 57(16):4143–4148. - PMC - PubMed
    1. Balakrishnan S, Kamisetty H, Carbonell JG, Lee S-I, Langmead CJ.. 2011. Learning generative models for protein fold families. Proteins 79(4):1061–1078. - PubMed
    1. Baldassi C, Zamparo M, Feinauer C, Procaccini A, Zecchina R, Weigt M, Pagnani A.. 2014. Fast and accurate multivariate Gaussian modeling of protein families: predicting residue contacts and protein-interaction partners. PLoS One 9(3):e92721. - PMC - PubMed

Publication types