Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Dec 26;41(1):btaf016.
doi: 10.1093/bioinformatics/btaf016.

PHIStruct: improving phage-host interaction prediction at low sequence similarity settings using structure-aware protein embeddings

Affiliations

PHIStruct: improving phage-host interaction prediction at low sequence similarity settings using structure-aware protein embeddings

Mark Edward M Gonzales et al. Bioinformatics. .

Abstract

Motivation: Recent computational approaches for predicting phage-host interaction have explored the use of sequence-only protein language models to produce embeddings of phage proteins without manual feature engineering. However, these embeddings do not directly capture protein structure information and structure-informed signals related to host specificity.

Results: We present PHIStruct, a multilayer perceptron that takes in structure-aware embeddings of receptor-binding proteins, generated via the structure-aware protein language model SaProt, and then predicts the host from among the ESKAPEE genera. Compared against recent tools, PHIStruct exhibits the best balance of precision and recall, with the highest and most stable F1 score across a wide range of confidence thresholds and sequence similarity settings. The margin in performance is most pronounced when the sequence similarity between the training and test sets drops below 40%, wherein, at a relatively high-confidence threshold of above 50%, PHIStruct presents a 7%-9% increase in class-averaged F1 over machine learning tools that do not directly incorporate structure information, as well as a 5%-6% increase over BLASTp.

Availability and implementation: The data and source code for our experiments and analyses are available at https://github.com/bioinfodlsu/PHIStruct.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Methodology. Step 1: We collected phage genome and protein sequences from GenBank (Benson et al. 2007) using INPHARED (Cook et al. 2021). Step 2: Receptor-binding proteins (RBPs) were identified following the methodology in our previous work (Gonzales et al. 2023). Step 3: We fed the RBP sequences to ColabFold (Mirdita et al. 2022) to predict their structures. Step 4: The proteins, alongside their predicted structures, were fed to SaProt. For a protein of length n, the input to SaProt is (r1,f1),(r2,f2),,(rn,fn), where r1,r2,,rn is the sequence representation and f1,f2,,fn is the structure representation from Foldseek (van Kempen et al. 2024). SaProt outputs the structure-aware vector representations (embeddings). Step 5: In constructing our training and test sets, we partitioned our dataset with respect to different train-versus-test sequence similarity thresholds via CD-HIT (Fu et al. 2012). Step 6: We built a two-hidden-layer perceptron that takes in the SaProt embedding of an RBP as input and outputs the host genus from among the ESKAPEE genera. Step 7: We evaluated our model’s performance. Icon sources: Bacteriophage: https://static.thenounproject.com/png/1372464-200.png; Deep learning: https://static.thenounproject.com/png/2424485-200.png; Isolated icon of a neural network. concept of artificial intelligence, deep learning and machine learning: https://t4.ftcdn.net/jpg/04/30/22/13/360_F_430221349_N1HJUZArv5f4dhmzOYUzuCpxGQZ5rTO5.jpg; Percentage free icon: https://cdn-icons-png.flaticon.com/512/156/156877.png; Protein structure flat simple icon: https://t4.ftcdn.net/jpg/04/30/22/13/360_F_430221349_N1HJUZArv5f4dhmzOYUzuCpxGQZ5rTO5.jpg.
Figure 2.
Figure 2.
PHIStruct classifier architecture. The number below the label of each layer denotes the size of that layer.
Figure 3.
Figure 3.
Comparison of the performance of PHIStruct with state-of-the-art machine learning and sequence alignment-based tools that map receptor-binding proteins to host bacteria. The maximum train-versus-test sequence similarity is set to s=40%. Performance is measured in terms of class-averaged (macro) metrics. (a) Precision–recall curves. The label of each point denotes the confidence threshold k (%) at which the performance was measured. (b) F1 scores. Higher values of k prioritize precision over recall, whereas lower values prioritize recall.
Figure 4.
Figure 4.
Visualization of the SaProt embeddings using uniform manifold approximation and projection (UMAP). We projected the top 25% SaProt embedding components with the highest importance based on Shapley additive explanations.
Figure 5.
Figure 5.
Comparison of the performance of different masking strategies for inputting proteins to SaProt. The maximum train-versus-test sequence similarity is set to s=40%. Performance is measured in terms of class-averaged (macro) metrics. (a) Precision–recall curves. The label of each point denotes the confidence threshold k (%) at which the performance was measured. (b) F1 scores. Higher values of k prioritize precision over recall, whereas lower values prioritize recall.
Figure 6.
Figure 6.
Comparison of the performance of PHIStruct with same-architecture multilayer perceptron models that take in sequence-only embeddings. The maximum train-versus-test sequence similarity is set to s=40%. Performance is measured in terms of class-averaged (macro) metrics. (a) Precision–recall curves. The label of each point denotes the confidence threshold k (%) at which the performance was measured. (b) F1 scores. Higher values of k prioritize precision over recall, whereas lower values prioritize recall.
Figure 7.
Figure 7.
Comparison of the performance of PHIStruct with same-architecture multilayer perceptron models that take in structure-aware protein embeddings other than SaProt. The maximum train-versus-test sequence similarity is set to s=40%. Performance is measured in terms of class-averaged (macro) metrics. (a) Precision–recall curves. The label of each point denotes the confidence threshold k (%) at which the performance was measured. (b) F1 scores. Higher values of k prioritize precision over recall, whereas lower values prioritize recall.
Figure 8.
Figure 8.
Comparison of the performance of PHIStruct with other downstream classifiers take in the same SaProt embeddings. The maximum train-versus-test sequence similarity is set to s=40%. Performance is measured in terms of class-averaged (macro) metrics. (a) Precision–recall curves. The label of each point denotes the confidence threshold k (%) at which the performance was measured. (b) F1 scores. Higher values of k prioritize precision over recall, whereas lower values prioritize recall.
Figure 9.
Figure 9.
Confusion matrices at maximum train-versus-test sequence similarity s=40%. (a) Confusion matrix at confidence threshold k=0%, normalized over the true class labels. The main diagonal reflects the per-class recall. (b) Confusion matrix at k=90%, normalized over the predicted class labels. The main diagonal reflects the per-class precision. Lower values of k prioritize recall over precision, whereas higher values prioritize precision.

Similar articles

Cited by

References

    1. Antimicrobial resistance surveillance in Europe 2023 - 2021 data. Stockholm: European Centre for Disease Prevention and Control and World Health Organization, 2023.
    1. Apweiler R, Bairoch A, Wu C. et al. UniProt: the universal protein knowledgebase. Nucleic Acids Res 2004;32:D115–9. - PMC - PubMed
    1. Ashworth EA, Wright RCT, Shears RK. et al. Exploiting lung adaptation and phage steering to clear pan-resistant Pseudomonas aeruginosa infections in vivo. Nat Commun 2024;15:1547. ISSN 2041–1723. - PMC - PubMed
    1. Ayobami O, Brinkwirth S, Eckmanns T. et al. Antibiotic resistance in hospital-acquired ESKAPE-E infections in low- and lower-middle-income countries: a systematic review and meta-analysis. Emerg Microbes Infect 2022;11:443–51. - PMC - PubMed
    1. Badam S, Rao S. Harnessing genome representation learning for decoding phage–host interactions. bioRxiv, 2024, preprint: not peer reviewed.