Massive experimental quantification allows interpretable deep learning of protein aggregation

Mike Thompson¹, Mariano Martín², Trinidad Sanmartín Olmo², Chandana Rajesh³, Peter K Koo³, Benedetta Bolognesi², Ben Lehner^{1

4

5

6}

Affiliations

¹ Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona 08003, Spain.
² Institute for Bioengineering of Catalonia (IBEC), Barcelona Institute of Science and Technology, Barcelona 08028, Spain.
³ Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA.
⁴ Universitat Pompeu Fabra (UPF), Barcelona 08002, Spain.
⁵ ICREA, Pg. Lluis Companys 23, Barcelona 08010, Spain.
⁶ Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1RQ, UK.

PMID: 40305601
PMCID: PMC12042874
DOI: 10.1126/sciadv.adt5111

Massive experimental quantification allows interpretable deep learning of protein aggregation

Mike Thompson et al. Sci Adv. 2025.

. 2025 May 2;11(18):eadt5111.

doi: 10.1126/sciadv.adt5111. Epub 2025 Apr 30.

Authors

Mike Thompson¹, Mariano Martín², Trinidad Sanmartín Olmo², Chandana Rajesh³, Peter K Koo³, Benedetta Bolognesi², Ben Lehner^{1

4

5

6}

Affiliations

¹ Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona 08003, Spain.
² Institute for Bioengineering of Catalonia (IBEC), Barcelona Institute of Science and Technology, Barcelona 08028, Spain.
³ Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA.
⁴ Universitat Pompeu Fabra (UPF), Barcelona 08002, Spain.
⁵ ICREA, Pg. Lluis Companys 23, Barcelona 08010, Spain.
⁶ Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1RQ, UK.

PMID: 40305601
PMCID: PMC12042874
DOI: 10.1126/sciadv.adt5111

Abstract

Protein aggregation is a pathological hallmark of more than 50 human diseases and a major problem for biotechnology. Methods have been proposed to predict aggregation from sequence, but these have been trained and evaluated on small and biased experimental datasets. Here we directly address this data shortage by experimentally quantifying the aggregation of >100,000 protein sequences. This unprecedented dataset reveals the limited performance of existing computational methods and allows us to train CANYA, a convolution-attention hybrid neural network that accurately predicts aggregation from sequence. We adapt genomic neural network interpretability analyses to reveal CANYA's decision-making process and learned grammar. Our results illustrate the power of massive experimental analysis of random sequence-spaces and provide an interpretable and robust neural network model to predict aggregation.

PubMed Disclaimer

Figures

**Fig. 1.. Quantifying the aggregation of >100,000 random peptides.**
(A) Examples of amyloids in human diseases. (B) The amyloid state is thermodynamically favorable but requires overcoming a kinetic barrier. (C) Experimental design. (D) While we explore more than 110,000 sequences, our dataset is a tiny sample of the possible sequence space. (E) The assayed aggregation scores of sequences labeled “aggregators” and “non-aggregators” in our experiment. (F) An example of a follow-up replication experiment using a synthesized library (NNK3; see fig. S2 for others; data file S3). AA, amino acid.

**Fig. 2.. Aggregation is poorly predicted by existing models and subtly related to amino acid composition.**
(A) The percent composition of residues grouped by their physicochemical properties in aggregators and non-aggregators. (B) The hydrophobicity and β sheet propensity of assayed sequences relative to known human amyloids (table S3) and the human proteome. (C) The predictive power (AUROC ±95% CI) of previous amyloid predictors on the random sequences. (D and E) The position-specific differences in amino acid frequencies across aggregating and non-aggregating sequences. Asterisks indicate marginal P value (chi-square test) lower than 0.05 “*”; lower than 0.01 “**”; lower than 0.001 “***.”

**Fig. 3.. CANYA.**
(A) CANYA is a three-layer neural network with 17,491 parameters. The model contains 100 filters, a single attention head with key length 6, a dense layer with 64 nodes, and finally a sigmoid output layer. (B to D) Evaluation metrics across the top 50 performing (of 100) model fits of CANYA. (B) The AUROC for held-out testing sequences. (C) The AUPR for held-out testing sequences. (D) The interpretability score (KL divergence; Methods) calculated on all held-out test sequences plotted against the mean AUPR across experiments. See fig. S4 for results on all 100 model fits.

**Fig. 4.. An additional experiment of >7000 random sequences.**
(A) The aggregation rates over an additional validation set of 7040 random sequences. (B) The predictive performance (AUROC ±95% CI) of CANYA and previous methods on the additional dataset.

**Fig. 5.. Stable performance of CANYA across diverse prediction tasks.**
The AUROC of CANYA and previous methods across several external datasets. Low-opacity bars represent cases in which the method used data from the testing dataset for training and thus are not valid out-of-sample evaluations. See text for additional descriptions of datasets (Methods and table S1) as well as performance reported as AUPR (fig. S7).

**Fig. 6.. CANYA finds physicochemical aggregation motifs.**
The motifs found by CANYA, clustered by their physicochemical properties and GIA effect sizes, and then sorted on the basis of their effect size magnitude. Translucency represents the ratio of cluster effect size compared to the strongest cluster (Methods). The enrichment (in AUROC) of motif cluster presence in secondary structures of resolved amyloids in UniProt (Methods and fig. S8). The dashed lines represent an AUROC of 0.50, and asterisks represent structures for which the enrichment was significantly higher than both 0.50 and the second most-enriched structure.

**Fig. 7.. In silico experiments reveal CANYA’s learned aggregation grammar.**
(A) An example of an experiment using GIA, an explainability tool to extract importance (effect sizes) of features in a model. Briefly, model predictions for a background set of sequences are compared to predictions on the same set of sequences with a feature (motif) embedded in them. (B) The distribution of effects from adding one to four copies of a cluster-motif to sequences. Points represent importance. (C) Interaction importance from adding motifs from two clusters to sequences. Warmer colors indicate higher CANYA score than from marginally adding the motifs (and their effects) separately to sequences, whereas cooler colors represent a CANYA score lower than expected from adding marginal motif effects. “X” indicates effects that were not significantly different from 0. (D) The position-dependence of motif effects. Plotted is the percent change of a position-specific effect relative to the motif’s global, position-averaged effect.

See this image and copyright information in PMC

Update of

Massive experimental quantification of amyloid nucleation allows interpretable deep learning of protein aggregation.
Thompson M, Martín M, Olmo TS, Rajesh C, Koo PK, Bolognesi B, Lehner B. Thompson M, et al. bioRxiv [Preprint]. 2024 Oct 1:2024.07.13.603366. doi: 10.1101/2024.07.13.603366. bioRxiv. 2024. Update in: Sci Adv. 2025 May 2;11(18):eadt5111. doi: 10.1126/sciadv.adt5111. PMID: 39071305 Free PMC article. Updated. Preprint.

References

1. Chiti F., Dobson C. M., Protein misfolding, amyloid formation, and human disease: A summary of progress over the last decade. Annu. Rev. Biochem. 86, 27–68 (2017). - PubMed
1. Fowler D. M., Koulov A. V., Balch W. E., Kelly J. W., Functional amyloid--from bacteria to humans. Trends Biochem. Sci. 32, 217–224 (2007). - PubMed
1. Shire S. J., Formulation and manufacturability of biologics. Curr. Opin. Biotechnol. 20, 708–714 (2009). - PubMed
1. Ke P. C., Zhou R., Serpell L. C., Riek R., Knowles T. P. J., Lashuel H. A., Gazit E., Hamley I. W., Davis T. P., Fändrich M., Otzen D. E., Chapman M. R., Dobson C. M., Eisenberg D. S., Mezzenga R., Half a century of amyloids: Past, present and future. Chem. Soc. Rev. 49, 5473–5509 (2020). - PMC - PubMed
1. Dobson C. M., Knowles T. P. J., Vendruscolo M., The amyloid phenomenon and its significance in biology and medicine. Cold Spring Harb. Perspect. Biol. 12, a033878 (2020). - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Massive experimental quantification allows interpretable deep learning of protein aggregation

Affiliations

Massive experimental quantification allows interpretable deep learning of protein aggregation

Authors

Affiliations

Abstract

Figures

Update of

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources