On the sparsity of fitness functions and implications for learning

David H Brookes¹, Amirali Aghazadeh², Jennifer Listgarten^{3

4}

Affiliations

¹ Biophysics Graduate Group, University of California, Berkeley, CA 94720.
² Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720.
³ Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720; jennl@berkeley.edu.
⁴ Center for Computational Biology, University of California, Berkeley, CA 94720.

PMID: 34937698
PMCID: PMC8740588
DOI: 10.1073/pnas.2109649118

On the sparsity of fitness functions and implications for learning

David H Brookes et al. Proc Natl Acad Sci U S A. 2022.

. 2022 Jan 4;119(1):e2109649118.

doi: 10.1073/pnas.2109649118.

Authors

David H Brookes¹, Amirali Aghazadeh², Jennifer Listgarten^{3

4}

Affiliations

¹ Biophysics Graduate Group, University of California, Berkeley, CA 94720.
² Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720.
³ Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720; jennl@berkeley.edu.
⁴ Center for Computational Biology, University of California, Berkeley, CA 94720.

PMID: 34937698
PMCID: PMC8740588
DOI: 10.1073/pnas.2109649118

Abstract

Fitness functions map biological sequences to a scalar property of interest. Accurate estimation of these functions yields biological insight and sets the foundation for model-based sequence design. However, the fitness datasets available to learn these functions are typically small relative to the large combinatorial space of sequences; characterizing how much data are needed for accurate estimation remains an open problem. There is a growing body of evidence demonstrating that empirical fitness functions display substantial sparsity when represented in terms of epistatic interactions. Moreover, the theory of Compressed Sensing provides scaling laws for the number of samples required to exactly recover a sparse function. Motivated by these results, we develop a framework to study the sparsity of fitness functions sampled from a generalization of the NK model, a widely used random field model of fitness functions. In particular, we present results that allow us to test the effect of the Generalized NK (GNK) model's interpretable parameters-sequence length, alphabet size, and assumed interactions between sequence positions-on the sparsity of fitness functions sampled from the model and, consequently, the number of measurements required to exactly recover these functions. We validate our framework by demonstrating that GNK models with parameters set according to structural considerations can be used to accurately approximate the number of samples required to recover two empirical protein fitness functions and an RNA fitness function. In addition, we show that these GNK models identify important higher-order epistatic interactions in the empirical fitness functions using only structural information.

Keywords: compressed sensing; epistasis; fitness functions; protein structure.

PubMed Disclaimer

Conflict of interest statement

Competing interest statement: J.L. is on the Scientific Advisory Board for Foresite Labs and Patch Biosciences.

Figures

**Fig. 1.**
Graphical depictions of GNK neighborhood schemes for L = 9 and K = 3. In each grid, rows represent neighborhoods, and columns represent sequence positions. A filled square in the ${(i, j)}^{th}$ position in the grid denotes that sequence position j is in the neighborhood $V^{[i]}$ . (A) Random neighborhoods. (B) Adjacent neighborhoods. (C) Block neighborhoods.

**Fig. 2.**
The sparsity of GNK fitness functions. (A) Upper bound on sparsity of GNK fitness functions with constant neighborhood sizes for q = 2 and a range of settings of the L and K parameters. (B) Upper bound for L = 20 and a range of settings of the alphabet size q and the K parameter (colors as in A). Alphabet sizes corresponding to binary (q = 2), nucleotide (q = 4), and amino acid (q = 20) alphabets are highlighted with open circles. (C) Sparsity of GNK fitness functions with neighborhoods constructed with each of the standard neighborhood schemes for L = 20, q = 2, a range of settings of K, denoted by markers. (D) Fraction of sampled GNK fitness functions with Random neighborhoods recovered at a range of settings of C. Each gray curve represents sampled fitness functions at a particular value of $L \in {5, 6, \dots, 13}, q \in {2, 3, 4}$ , and $K \in {1, 2, 3, 4, 5}$ . The red curve averages over all 907 sampled functions. The value C = 2.62, which we chose to use for subsequent numerical experiments, is highlighted with a dashed line.

**Fig. 3.**
Minimum number of measurements required to exactly recover GNK fitness functions with constant neighborhood sizes. (A) Upper bound on the minimum N required to recover GNK fitness functions with constant neighborhood sizes for q = 2 and a range of settings of the L and K parameters. (B) Upper bound for L = 20 and a range of settings of the alphabet size q and the K parameter (colors as in A). Alphabet sizes corresponding to binary (q = 2), nucleotide (q = 4), and amino acid (q = 20) alphabets are highlighted with open circles.

**Fig. 4.**
Comparison of empirical fitness functions to GNK models with Structural neighborhoods. (*Top*) Comparison to mTagBFP2 fitness function from ref. . (*Middle*) Comparison to His3p fitness function from ref. (46). (*Bottom*) Comparison to quasi-empirical fitness function of the Hammerhead ribozyme HH9. (A) Structural neighborhoods derived from crystal structural of TagBFP (*Top*), I-TASSER predicted structure of His3p (*Middle*), and predicted secondary structures of the Hammerhead Ribozyme HH9 (*Bottom*). (B) Magnitude of empirical Fourier coefficients (upper plot, in blue) compared to expected magnitudes of coefficients in the GNK model (reverse plot, in red). Dashed lines separate orders of epistatic interactions, with each group of $r^{th}$ -order interactions indicated. (C) Percent variance explained by the largest Fourier coefficients in the empirical fitness functions and in fitness functions sampled from the GNK model. The dotted line indicates the exact sparsity of the GNK fitness functions, which is 56 in *Top*, 76 in *Middle*, and 1,033 in *Bottom*, at which points 97.1%, 90.4%, and 97.5% of the empirical variances are explained, respectively. Std. dev., SD. (D) Error of LASSO estimates of empirical fitness functions at a range of training set sizes. Each point on the horizontal axis represents the number of training samples, N, that were used to fit the LASSO estimate of the fitness function. Each point on the blue curve represents the R² between the estimated and empirical fitness functions, averaged over 50 randomly sampled training sets of size N. The point at the number of samples required to exactly recover the GNK model with Structural neighborhoods (N = 575 in *Top*, N = 660 in *Middle*, and N = 13, 036 in *Bottom*) is highlighted with a red dot and dashed lines; at this number of samples, the mean prediction R² is 0.948 in *Top*, 0.870 in *Middle*, and 0.969 in *Bottom*. Error bars indicate the SD of R² over training replicates. D, *Insets* show paired plots between the estimated and predicted fitness function for one example training set of size N = 575 (*Top*), N = 660 (*Middle*), and N = 13, 036 (*Bottom*).

See this image and copyright information in PMC

References

1. Otwinowski J., Plotkin J. B., Inferring fitness landscapes by regression produces biased estimates of epistasis. Proc. Natl. Acad. Sci. U.S.A. 111, E2301–E2309 (2014). - PMC - PubMed
1. Otwinowski J., Biophysical inference of epistasis and the effects of mutations on protein stability and function. Mol. Biol. Evol. 35, 2345–2354 (2018). - PMC - PubMed
1. Ballal A., et al. ., Sparse epistatic patterns in the evolution of terpene synthases. Mol. Biol. Evol. 37, 1907–1924 (2020). - PubMed
1. Romero P. A., Krause A., Arnold F. H., Navigating the protein fitness landscape with Gaussian processes. Proc. Natl. Acad. Sci. U.S.A. 110, E193–E201 (2013). - PMC - PubMed
1. Zhou J., McCandlish D. M., Minimum epistasis interpolation for sequence-function relationships. Nat. Commun. 11, 1782 (2020). - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

On the sparsity of fitness functions and implications for learning

Affiliations

On the sparsity of fitness functions and implications for learning

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources