Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 1;41(Supplement_1):i420-i428.
doi: 10.1093/bioinformatics/btaf200.

Harnessing deep learning for proteome-scale detection of amyloid signaling motifs

Affiliations

Harnessing deep learning for proteome-scale detection of amyloid signaling motifs

Krzysztof Pysz et al. Bioinformatics. .

Abstract

Motivation: Amyloid signaling sequences adopt the cross-β fold that is capable of self-replication in the templating process. Propagation of the amyloid fold from the receptor to the effector protein is used for signal transduction in the immune response pathways in animals, fungi, and bacteria. So far, a dozen of families of amyloid signaling motifs (ASMs) have been classified. Unfortunately, due to the wide variety of ASMs it is difficult to identify them in large protein databases available, which limits the possibility of conducting experimental studies. To date, various deep learning (DL) models have been applied across a range of protein-related tasks, including domain family classification and the prediction of protein structure and protein-protein interactions.

Results: In this study, we develop tailor-made bidirectional LSTM and BERT-based architectures to model ASM, and compare their performance against a state-of-the-art machine learning grammatical model. Our research is focused on developing a discriminative model of generalized ASMs, capable of detecting ASMs in large datasets. The DL-based models are trained on a diverse set of motif families and a global negative set, and used to identify ASMs from remotely related families. We analyze how both models represent the data and demonstrate that the DL-based approaches effectively detect ASMs, including novel motifs, even at the genome scale.

Availability and implementation: The models are provided as a Python package, asmscan-bilstm, and a Docker image at https://github.com/chrispysz/asmscan-proteinbert-run. The source code can be accessed at https://github.com/jakub-galazka/asmscan-bilstm and https://github.com/chrispysz/asmscan-proteinbert. Data and results are at https://github.com/wdyrka-pwr/ASMscan.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Sequence diversity versus average sequence length in Uniprot-based full alignments for Pfam profiles (Mistry et al. 2021). Vertical line corresponds to the average length of 40 amino acids.
Figure 2.
Figure 2.
The bidirectional LSTM model architecture consists of two hidden layers. It processes a batch (?) of short protein input sequences of length 40, beginning with an embedding layer that maps inputs to dense representations of size 8. The first hidden layer is a bidirectional LSTM, where each direction has 8 units, resulting in a total of 16 units. It uses tanh activation to capture contextual dependencies from both directions of the sequence. The second hidden layer is a unidirectional LSTM with 4 units and tanh activation, further refining the sequence representation. The output layer is a dense unit with a sigmoid activation, producing the probability of detecting an ASM within the sequence. Dropout layers (10%) are applied between the embedding, bidirectional LSTM, and unidirectional LSTM layers to regulate the model and prevent overfitting.
Figure 3.
Figure 3.
Recall of the combined models versus the FPR on the negative test set sourced from PB40. Positive test samples are (A) BASS_N, (B) FASS_N, and (C) FASS_C. Data points are connected with smooth solid or dotted lines only to improve readability.
Figure 4.
Figure 4.
Accuracy of BASS and FASS motif position determination in N-terminal NLR domains (BASS_Ndom, FASS_Ndom) by PCFG (A, B) and ProteinBERT (C, D) combined models. Begin (A, C) and end (B, D) shifts were calculated in relation to positions of rigorously cut test motifs (BASS_N, FASS_N). Histograms include shifts for all best hits regardless of their score/probability except for the only BASS motif which was entirely mispositioned by both methods. Bacterial motifs are shown above the axis, fungal motifs—below the axis.
Figure 5.
Figure 5.
Embeddings of entire N- and C-terminal domains from the BASS_Ndom, FASS_Ndom, and FASS_Cdom sets along with the CsgA and NLReff sets projected to 2-dimensional space using the UMAP technique derived from the final dropout layer of the ProteinBERT (A) and BiLSTM (B) models. Dots represent sequences where ASM was detected, and crosses represent sequences where ASM was not detected, applying a default cut-off of 0.5. The retention plots for the ProteinBERT (C) and BiLSTM (D) models show what percentage of the set are sequences where ASM was detected.
Figure 6.
Figure 6.
Embedding domains of bacterial and fungal origin with the division into families of ASMs projected to 2-dimensional space using the UMAP technique derived from the final dropout layer of the ProteinBERT (A) and BiLSTM (B) models. Dots represent sequences where ASM was detected, and crosses represent sequences where ASM was not detected, applying a default cut-off of 0.5. The retention plots for the ProteinBERT (C) and BiLSTM (D) models show what percentage of the set are sequences where ASM was detected.

References

    1. Alley EC, Khimulya G, Biswas S et al. Unified rational protein engineering with sequence-based deep representation learning. Nat Methods 2019;16:1315–22. - PMC - PubMed
    1. Bileschi ML, Belanger D, Bryant DH et al. Using deep learning to annotate the protein universe. Nat Biotechnol 2022;40:932–7. - PubMed
    1. Booth TL. Probabilistic representation of formal languages. In: Proceedings of the 10th Annual Symposium on Switching and Automata Theory. 1969, 74–81. New York, NY, USA: IEEE.
    1. Booth TL, Thompson RA. Applying probability measures to abstract languages. IEEE Trans Comput 1973;C-22:442–50.
    1. Brandes N, Ofer D, Peleg Y et al. ProteinBERT: a universal deeplearning model of protein sequence and function. Bioinformatics 2022;38:2102–10. - PMC - PubMed

LinkOut - more resources