Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 May 2;11(18):eadt5111.
doi: 10.1126/sciadv.adt5111. Epub 2025 Apr 30.

Massive experimental quantification allows interpretable deep learning of protein aggregation

Affiliations

Massive experimental quantification allows interpretable deep learning of protein aggregation

Mike Thompson et al. Sci Adv. .

Abstract

Protein aggregation is a pathological hallmark of more than 50 human diseases and a major problem for biotechnology. Methods have been proposed to predict aggregation from sequence, but these have been trained and evaluated on small and biased experimental datasets. Here we directly address this data shortage by experimentally quantifying the aggregation of >100,000 protein sequences. This unprecedented dataset reveals the limited performance of existing computational methods and allows us to train CANYA, a convolution-attention hybrid neural network that accurately predicts aggregation from sequence. We adapt genomic neural network interpretability analyses to reveal CANYA's decision-making process and learned grammar. Our results illustrate the power of massive experimental analysis of random sequence-spaces and provide an interpretable and robust neural network model to predict aggregation.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.. Quantifying the aggregation of >100,000 random peptides.
(A) Examples of amyloids in human diseases. (B) The amyloid state is thermodynamically favorable but requires overcoming a kinetic barrier. (C) Experimental design. (D) While we explore more than 110,000 sequences, our dataset is a tiny sample of the possible sequence space. (E) The assayed aggregation scores of sequences labeled “aggregators” and “non-aggregators” in our experiment. (F) An example of a follow-up replication experiment using a synthesized library (NNK3; see fig. S2 for others; data file S3). AA, amino acid.
Fig. 2.
Fig. 2.. Aggregation is poorly predicted by existing models and subtly related to amino acid composition.
(A) The percent composition of residues grouped by their physicochemical properties in aggregators and non-aggregators. (B) The hydrophobicity and β sheet propensity of assayed sequences relative to known human amyloids (table S3) and the human proteome. (C) The predictive power (AUROC ±95% CI) of previous amyloid predictors on the random sequences. (D and E) The position-specific differences in amino acid frequencies across aggregating and non-aggregating sequences. Asterisks indicate marginal P value (chi-square test) lower than 0.05 “*”; lower than 0.01 “**”; lower than 0.001 “***.”
Fig. 3.
Fig. 3.. CANYA.
(A) CANYA is a three-layer neural network with 17,491 parameters. The model contains 100 filters, a single attention head with key length 6, a dense layer with 64 nodes, and finally a sigmoid output layer. (B to D) Evaluation metrics across the top 50 performing (of 100) model fits of CANYA. (B) The AUROC for held-out testing sequences. (C) The AUPR for held-out testing sequences. (D) The interpretability score (KL divergence; Methods) calculated on all held-out test sequences plotted against the mean AUPR across experiments. See fig. S4 for results on all 100 model fits.
Fig. 4.
Fig. 4.. An additional experiment of >7000 random sequences.
(A) The aggregation rates over an additional validation set of 7040 random sequences. (B) The predictive performance (AUROC ±95% CI) of CANYA and previous methods on the additional dataset.
Fig. 5.
Fig. 5.. Stable performance of CANYA across diverse prediction tasks.
The AUROC of CANYA and previous methods across several external datasets. Low-opacity bars represent cases in which the method used data from the testing dataset for training and thus are not valid out-of-sample evaluations. See text for additional descriptions of datasets (Methods and table S1) as well as performance reported as AUPR (fig. S7).
Fig. 6.
Fig. 6.. CANYA finds physicochemical aggregation motifs.
The motifs found by CANYA, clustered by their physicochemical properties and GIA effect sizes, and then sorted on the basis of their effect size magnitude. Translucency represents the ratio of cluster effect size compared to the strongest cluster (Methods). The enrichment (in AUROC) of motif cluster presence in secondary structures of resolved amyloids in UniProt (Methods and fig. S8). The dashed lines represent an AUROC of 0.50, and asterisks represent structures for which the enrichment was significantly higher than both 0.50 and the second most-enriched structure.
Fig. 7.
Fig. 7.. In silico experiments reveal CANYA’s learned aggregation grammar.
(A) An example of an experiment using GIA, an explainability tool to extract importance (effect sizes) of features in a model. Briefly, model predictions for a background set of sequences are compared to predictions on the same set of sequences with a feature (motif) embedded in them. (B) The distribution of effects from adding one to four copies of a cluster-motif to sequences. Points represent importance. (C) Interaction importance from adding motifs from two clusters to sequences. Warmer colors indicate higher CANYA score than from marginally adding the motifs (and their effects) separately to sequences, whereas cooler colors represent a CANYA score lower than expected from adding marginal motif effects. “X” indicates effects that were not significantly different from 0. (D) The position-dependence of motif effects. Plotted is the percent change of a position-specific effect relative to the motif’s global, position-averaged effect.

Update of

References

    1. Chiti F., Dobson C. M., Protein misfolding, amyloid formation, and human disease: A summary of progress over the last decade. Annu. Rev. Biochem. 86, 27–68 (2017). - PubMed
    1. Fowler D. M., Koulov A. V., Balch W. E., Kelly J. W., Functional amyloid--from bacteria to humans. Trends Biochem. Sci. 32, 217–224 (2007). - PubMed
    1. Shire S. J., Formulation and manufacturability of biologics. Curr. Opin. Biotechnol. 20, 708–714 (2009). - PubMed
    1. Ke P. C., Zhou R., Serpell L. C., Riek R., Knowles T. P. J., Lashuel H. A., Gazit E., Hamley I. W., Davis T. P., Fändrich M., Otzen D. E., Chapman M. R., Dobson C. M., Eisenberg D. S., Mezzenga R., Half a century of amyloids: Past, present and future. Chem. Soc. Rev. 49, 5473–5509 (2020). - PMC - PubMed
    1. Dobson C. M., Knowles T. P. J., Vendruscolo M., The amyloid phenomenon and its significance in biology and medicine. Cold Spring Harb. Perspect. Biol. 12, a033878 (2020). - PMC - PubMed

LinkOut - more resources