Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jan 4;119(1):e2109649118.
doi: 10.1073/pnas.2109649118.

On the sparsity of fitness functions and implications for learning

Affiliations

On the sparsity of fitness functions and implications for learning

David H Brookes et al. Proc Natl Acad Sci U S A. .

Abstract

Fitness functions map biological sequences to a scalar property of interest. Accurate estimation of these functions yields biological insight and sets the foundation for model-based sequence design. However, the fitness datasets available to learn these functions are typically small relative to the large combinatorial space of sequences; characterizing how much data are needed for accurate estimation remains an open problem. There is a growing body of evidence demonstrating that empirical fitness functions display substantial sparsity when represented in terms of epistatic interactions. Moreover, the theory of Compressed Sensing provides scaling laws for the number of samples required to exactly recover a sparse function. Motivated by these results, we develop a framework to study the sparsity of fitness functions sampled from a generalization of the NK model, a widely used random field model of fitness functions. In particular, we present results that allow us to test the effect of the Generalized NK (GNK) model's interpretable parameters-sequence length, alphabet size, and assumed interactions between sequence positions-on the sparsity of fitness functions sampled from the model and, consequently, the number of measurements required to exactly recover these functions. We validate our framework by demonstrating that GNK models with parameters set according to structural considerations can be used to accurately approximate the number of samples required to recover two empirical protein fitness functions and an RNA fitness function. In addition, we show that these GNK models identify important higher-order epistatic interactions in the empirical fitness functions using only structural information.

Keywords: compressed sensing; epistasis; fitness functions; protein structure.

PubMed Disclaimer

Conflict of interest statement

Competing interest statement: J.L. is on the Scientific Advisory Board for Foresite Labs and Patch Biosciences.

Figures

Fig. 1.
Fig. 1.
Graphical depictions of GNK neighborhood schemes for L = 9 and K = 3. In each grid, rows represent neighborhoods, and columns represent sequence positions. A filled square in the (i,j)th position in the grid denotes that sequence position j is in the neighborhood V[i]. (A) Random neighborhoods. (B) Adjacent neighborhoods. (C) Block neighborhoods.
Fig. 2.
Fig. 2.
The sparsity of GNK fitness functions. (A) Upper bound on sparsity of GNK fitness functions with constant neighborhood sizes for q = 2 and a range of settings of the L and K parameters. (B) Upper bound for L = 20 and a range of settings of the alphabet size q and the K parameter (colors as in A). Alphabet sizes corresponding to binary (q = 2), nucleotide (q = 4), and amino acid (q = 20) alphabets are highlighted with open circles. (C) Sparsity of GNK fitness functions with neighborhoods constructed with each of the standard neighborhood schemes for L = 20, q = 2, a range of settings of K, denoted by markers. (D) Fraction of sampled GNK fitness functions with Random neighborhoods recovered at a range of settings of C. Each gray curve represents sampled fitness functions at a particular value of L{5,6,,13},q{2,3,4}, and K{1,2,3,4,5}. The red curve averages over all 907 sampled functions. The value C = 2.62, which we chose to use for subsequent numerical experiments, is highlighted with a dashed line.
Fig. 3.
Fig. 3.
Minimum number of measurements required to exactly recover GNK fitness functions with constant neighborhood sizes. (A) Upper bound on the minimum N required to recover GNK fitness functions with constant neighborhood sizes for q = 2 and a range of settings of the L and K parameters. (B) Upper bound for L = 20 and a range of settings of the alphabet size q and the K parameter (colors as in A). Alphabet sizes corresponding to binary (q = 2), nucleotide (q = 4), and amino acid (q = 20) alphabets are highlighted with open circles.
Fig. 4.
Fig. 4.
Comparison of empirical fitness functions to GNK models with Structural neighborhoods. (Top) Comparison to mTagBFP2 fitness function from ref. . (Middle) Comparison to His3p fitness function from ref. (46). (Bottom) Comparison to quasi-empirical fitness function of the Hammerhead ribozyme HH9. (A) Structural neighborhoods derived from crystal structural of TagBFP (Top), I-TASSER predicted structure of His3p (Middle), and predicted secondary structures of the Hammerhead Ribozyme HH9 (Bottom). (B) Magnitude of empirical Fourier coefficients (upper plot, in blue) compared to expected magnitudes of coefficients in the GNK model (reverse plot, in red). Dashed lines separate orders of epistatic interactions, with each group of rth-order interactions indicated. (C) Percent variance explained by the largest Fourier coefficients in the empirical fitness functions and in fitness functions sampled from the GNK model. The dotted line indicates the exact sparsity of the GNK fitness functions, which is 56 in Top, 76 in Middle, and 1,033 in Bottom, at which points 97.1%, 90.4%, and 97.5% of the empirical variances are explained, respectively. Std. dev., SD. (D) Error of LASSO estimates of empirical fitness functions at a range of training set sizes. Each point on the horizontal axis represents the number of training samples, N, that were used to fit the LASSO estimate of the fitness function. Each point on the blue curve represents the R2 between the estimated and empirical fitness functions, averaged over 50 randomly sampled training sets of size N. The point at the number of samples required to exactly recover the GNK model with Structural neighborhoods (N = 575 in Top, N = 660 in Middle, and N = 13, 036 in Bottom) is highlighted with a red dot and dashed lines; at this number of samples, the mean prediction R2 is 0.948 in Top, 0.870 in Middle, and 0.969 in Bottom. Error bars indicate the SD of R2 over training replicates. D, Insets show paired plots between the estimated and predicted fitness function for one example training set of size N = 575 (Top), N = 660 (Middle), and N = 13, 036 (Bottom).

References

    1. Otwinowski J., Plotkin J. B., Inferring fitness landscapes by regression produces biased estimates of epistasis. Proc. Natl. Acad. Sci. U.S.A. 111, E2301–E2309 (2014). - PMC - PubMed
    1. Otwinowski J., Biophysical inference of epistasis and the effects of mutations on protein stability and function. Mol. Biol. Evol. 35, 2345–2354 (2018). - PMC - PubMed
    1. Ballal A., et al. ., Sparse epistatic patterns in the evolution of terpene synthases. Mol. Biol. Evol. 37, 1907–1924 (2020). - PubMed
    1. Romero P. A., Krause A., Arnold F. H., Navigating the protein fitness landscape with Gaussian processes. Proc. Natl. Acad. Sci. U.S.A. 110, E193–E201 (2013). - PMC - PubMed
    1. Zhou J., McCandlish D. M., Minimum epistasis interpolation for sequence-function relationships. Nat. Commun. 11, 1782 (2020). - PMC - PubMed

Publication types