. 2024 Dec 26;41(1):btaf016.

doi: 10.1093/bioinformatics/btaf016.

PHIStruct: improving phage-host interaction prediction at low sequence similarity settings using structure-aware protein embeddings

Mark Edward M Gonzales^{1

2}, Jennifer C Ureta^{1

2

3}, Anish M S Shrestha^{1

2}

Affiliations

¹ Bioinformatics Lab, Advanced Research Institute for Informatics, Computing and Networking, De La Salle University, Manila 1004, Philippines.
² College of Computer Studies, De La Salle University, Manila 1004, Philippines.
³ Walter and Eliza Hall Institute of Medical Research, Melbourne, VIC 3052, Australia.

PMID: 39804673
PMCID: PMC11783280
DOI: 10.1093/bioinformatics/btaf016

PHIStruct: improving phage-host interaction prediction at low sequence similarity settings using structure-aware protein embeddings

Mark Edward M Gonzales et al. Bioinformatics. 2024.

. 2024 Dec 26;41(1):btaf016.

doi: 10.1093/bioinformatics/btaf016.

Authors

Mark Edward M Gonzales^{1

2}, Jennifer C Ureta^{1

2

3}, Anish M S Shrestha^{1

2}

Affiliations

¹ Bioinformatics Lab, Advanced Research Institute for Informatics, Computing and Networking, De La Salle University, Manila 1004, Philippines.
² College of Computer Studies, De La Salle University, Manila 1004, Philippines.
³ Walter and Eliza Hall Institute of Medical Research, Melbourne, VIC 3052, Australia.

PMID: 39804673
PMCID: PMC11783280
DOI: 10.1093/bioinformatics/btaf016

Abstract

Motivation: Recent computational approaches for predicting phage-host interaction have explored the use of sequence-only protein language models to produce embeddings of phage proteins without manual feature engineering. However, these embeddings do not directly capture protein structure information and structure-informed signals related to host specificity.

Results: We present PHIStruct, a multilayer perceptron that takes in structure-aware embeddings of receptor-binding proteins, generated via the structure-aware protein language model SaProt, and then predicts the host from among the ESKAPEE genera. Compared against recent tools, PHIStruct exhibits the best balance of precision and recall, with the highest and most stable F1 score across a wide range of confidence thresholds and sequence similarity settings. The margin in performance is most pronounced when the sequence similarity between the training and test sets drops below 40%, wherein, at a relatively high-confidence threshold of above 50%, PHIStruct presents a 7%-9% increase in class-averaged F1 over machine learning tools that do not directly incorporate structure information, as well as a 5%-6% increase over BLASTp.

Availability and implementation: The data and source code for our experiments and analyses are available at https://github.com/bioinfodlsu/PHIStruct.

PubMed Disclaimer

Figures

**Figure 1.**
Methodology. Step 1: We collected phage genome and protein sequences from GenBank (Benson *et al.* 2007) using INPHARED (Cook *et al.* 2021). Step 2: Receptor-binding proteins (RBPs) were identified following the methodology in our previous work (Gonzales *et al.* 2023). Step 3: We fed the RBP sequences to ColabFold (Mirdita *et al.* 2022) to predict their structures. Step 4: The proteins, alongside their predicted structures, were fed to SaProt. For a protein of length n, the input to SaProt is $〈 (r_{1}, f_{1}), (r_{2}, f_{2}), \dots, (r_{n}, f_{n}) 〉$ , where $〈 r_{1}, r_{2}, \dots, r_{n} 〉$ is the sequence representation and $〈 f_{1}, f_{2}, \dots, f_{n} 〉$ is the structure representation from Foldseek (van Kempen *et al.* 2024). SaProt outputs the structure-aware vector representations (embeddings). Step 5: In constructing our training and test sets, we partitioned our dataset with respect to different train-versus-test sequence similarity thresholds via CD-HIT (Fu *et al.* 2012). Step 6: We built a two-hidden-layer perceptron that takes in the SaProt embedding of an RBP as input and outputs the host genus from among the ESKAPEE genera. Step 7: We evaluated our model’s performance. Icon sources: Bacteriophage: https://static.thenounproject.com/png/1372464-200.png; Deep learning: https://static.thenounproject.com/png/2424485-200.png; Isolated icon of a neural network. concept of artificial intelligence, deep learning and machine learning: https://t4.ftcdn.net/jpg/04/30/22/13/360_F_430221349_N1HJUZArv5f4dhmzOYUzuCpxGQZ5rTO5.jpg; Percentage free icon: https://cdn-icons-png.flaticon.com/512/156/156877.png; Protein structure flat simple icon: https://t4.ftcdn.net/jpg/04/30/22/13/360_F_430221349_N1HJUZArv5f4dhmzOYUzuCpxGQZ5rTO5.jpg.

**Figure 2.**
PHIStruct classifier architecture. The number below the label of each layer denotes the size of that layer.

**Figure 3.**
Comparison of the performance of PHIStruct with state-of-the-art machine learning and sequence alignment-based tools that map receptor-binding proteins to host bacteria. The maximum train-versus-test sequence similarity is set to $s = 40 %$ . Performance is measured in terms of class-averaged (macro) metrics. (a) Precision–recall curves. The label of each point denotes the confidence threshold k (%) at which the performance was measured. (b) F1 scores. Higher values of k prioritize precision over recall, whereas lower values prioritize recall.

**Figure 4.**
Visualization of the SaProt embeddings using uniform manifold approximation and projection (UMAP). We projected the top 25% SaProt embedding components with the highest importance based on Shapley additive explanations.

**Figure 5.**
Comparison of the performance of different masking strategies for inputting proteins to SaProt. The maximum train-versus-test sequence similarity is set to $s = 40 %$ . Performance is measured in terms of class-averaged (macro) metrics. (a) Precision–recall curves. The label of each point denotes the confidence threshold k (%) at which the performance was measured. (b) F1 scores. Higher values of k prioritize precision over recall, whereas lower values prioritize recall.

**Figure 6.**
Comparison of the performance of PHIStruct with same-architecture multilayer perceptron models that take in sequence-only embeddings. The maximum train-versus-test sequence similarity is set to $s = 40 %$ . Performance is measured in terms of class-averaged (macro) metrics. (a) Precision–recall curves. The label of each point denotes the confidence threshold k (%) at which the performance was measured. (b) F1 scores. Higher values of k prioritize precision over recall, whereas lower values prioritize recall.

**Figure 7.**
Comparison of the performance of PHIStruct with same-architecture multilayer perceptron models that take in structure-aware protein embeddings other than SaProt. The maximum train-versus-test sequence similarity is set to $s = 40 %$ . Performance is measured in terms of class-averaged (macro) metrics. (a) Precision–recall curves. The label of each point denotes the confidence threshold k (%) at which the performance was measured. (b) F1 scores. Higher values of k prioritize precision over recall, whereas lower values prioritize recall.

**Figure 8.**
Comparison of the performance of PHIStruct with other downstream classifiers take in the same SaProt embeddings. The maximum train-versus-test sequence similarity is set to $s = 40 %$ . Performance is measured in terms of class-averaged (macro) metrics. (a) Precision–recall curves. The label of each point denotes the confidence threshold k (%) at which the performance was measured. (b) F1 scores. Higher values of k prioritize precision over recall, whereas lower values prioritize recall.

**Figure 9.**
Confusion matrices at maximum train-versus-test sequence similarity $s = 40 %$ . (a) Confusion matrix at confidence threshold $k = 0 %$ , normalized over the true class labels. The main diagonal reflects the per-class recall. (b) Confusion matrix at $k = 90 %$ , normalized over the predicted class labels. The main diagonal reflects the per-class precision. Lower values of k prioritize recall over precision, whereas higher values prioritize precision.

See this image and copyright information in PMC

References

1. Antimicrobial resistance surveillance in Europe 2023 - 2021 data. Stockholm: European Centre for Disease Prevention and Control and World Health Organization, 2023.
1. Apweiler R, Bairoch A, Wu C. et al. UniProt: the universal protein knowledgebase. Nucleic Acids Res 2004;32:D115–9. - PMC - PubMed
1. Ashworth EA, Wright RCT, Shears RK. et al. Exploiting lung adaptation and phage steering to clear pan-resistant Pseudomonas aeruginosa infections in vivo. Nat Commun 2024;15:1547. ISSN 2041–1723. - PMC - PubMed
1. Ayobami O, Brinkwirth S, Eckmanns T. et al. Antibiotic resistance in hospital-acquired ESKAPE-E infections in low- and lower-middle-income countries: a systematic review and meta-analysis. Emerg Microbes Infect 2022;11:443–51. - PMC - PubMed
1. Badam S, Rao S. Harnessing genome representation learning for decoding phage–host interactions. bioRxiv, 2024, preprint: not peer reviewed.

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

Department of Science and Technology Philippine Council for Health Research and Development

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

PHIStruct: improving phage-host interaction prediction at low sequence similarity settings using structure-aware protein embeddings

Affiliations

PHIStruct: improving phage-host interaction prediction at low sequence similarity settings using structure-aware protein embeddings

Authors

Affiliations

Abstract

Figures

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Molecular Biology Databases