Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jan 19;24(1):bbac487.
doi: 10.1093/bib/bbac487.

PhaTYP: predicting the lifestyle for bacteriophages using BERT

Affiliations

PhaTYP: predicting the lifestyle for bacteriophages using BERT

Jiayu Shang et al. Brief Bioinform. .

Abstract

Bacteriophages (or phages), which infect bacteria, have two distinct lifestyles: virulent and temperate. Predicting the lifestyle of phages helps decipher their interactions with their bacterial hosts, aiding phages' applications in fields such as phage therapy. Because experimental methods for annotating the lifestyle of phages cannot keep pace with the fast accumulation of sequenced phages, computational method for predicting phages' lifestyles has become an attractive alternative. Despite some promising results, computational lifestyle prediction remains difficult because of the limited known annotations and the sheer amount of sequenced phage contigs assembled from metagenomic data. In particular, most of the existing tools cannot precisely predict phages' lifestyles for short contigs. In this work, we develop PhaTYP (Phage TYPe prediction tool) to improve the accuracy of lifestyle prediction on short contigs. We design two different training tasks, self-supervised and fine-tuning tasks, to overcome lifestyle prediction difficulties. We rigorously tested and compared PhaTYP with four state-of-the-art methods: DeePhage, PHACTS, PhagePred and BACPHLIP. The experimental results show that PhaTYP outperforms all these methods and achieves more stable performance on short contigs. In addition, we demonstrated the utility of PhaTYP for analyzing the phage lifestyle on human neonates' gut data. This application shows that PhaTYP is a useful means for studying phages in metagenomic data and helps extend our understanding of microbial communities.

Keywords: BRET; deep learning; phage lifestyle prediction; virulent and temperate phages.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Two training tasks for PhaTYP using BERT. (A) The self-supervised learning task. The input is the masked sentence, and the output is the predicted token at the masked position.‘1 All phage genomes in RefSeq database to train a Mask LM model. (B) The fine-tuning task for lifestyle prediction. The pretrained model is fine-tuned using phages with known lifestyle annotations. The inputs of the model are protein-based sentences, and the outputs are the probabilities of two lifestyle classes: virulent and temperate.
Figure 2
Figure 2
Sequence embedding method in PhaTYP. The block in ‘Protein Sentence’ represents the ID of the protein-based token. PCformula image: a protein cluster formula image. [CLS]: start token. [SEP]: separation token. Eformula image: the embedded vector for protein cluster PCformula image. ‘+’: vector addition.
Figure 3
Figure 3
The transformer block in PhaTYP. There are three main units in the transformer block: feed-forward network, residual connections and multi-head attention mechanism.
Figure 4
Figure 4
The ROC curve comparison on the complete phage genomes. The value shown in the legend is the AUCROC score. ‘Without SSL task’: training PhaTYP without the self-supervised learning task.
Figure 5
Figure 5
The running time comparison of five tools in the experiment of ten-fold cross validation. All the methods are run on Intel® Xeon® Gold 6258R CPU and 2080Ti GPU.
Figure 6
Figure 6
The ROC curve comparison on low similarity test set. The value shown in the legend is AUCROC score. ‘without SSL task’: training PhaTYP without self-supervised learning task.
Figure 7
Figure 7
The ROC curve comparison on the short contigs. The value shown in the legend is AUCROC score. ‘without SSL task’: training PhaTYP without self-supervised learning task. PhaTYP has the best performance.
Figure 8
Figure 8
The classification results on the 33 virulent crAssphages. PhaTYP can correctly predict all the crAssphages as being virulent.
Figure 9
Figure 9
The classification results on the contigs that have homologous regions with integrase proteins. Both PhaTYP and DeePhage can correctly predict all the contigs as being temperate.
Figure 10
Figure 10
Violin plot at different months. (A) Predictions on all 2291 phages. (B) Predictions on newly colonized phages. Y-axis: the percentage of temperate phages in each sample.
Figure 11
Figure 11
Violin plot in different delivery type: C-Section with labor (CS(w/)L), C-Section without labor (CS(w/o)L), and SVD. To control variables, we group the samples according to the ages. The Y-axis represents the percentage of temperate phages in each sample.
Figure 12
Figure 12
Violin plot in different feeding types. Formula: the infants were fed with formula milk. Mixed: infants were fed with both formula and breast milk. To control variables, we group the samples according to the ages. The Y-axis represents the percentage of temperate phages in each sample.

Similar articles

Cited by

References

    1. McGrath S, van Sinderen, et al. . Bacteriophage: Genetics and Molecular Biology. Wymondham, UK: Caister Academic Press, 2007.
    1. Zhong Z-P, Tian F, Simon Roux M, et al. . Glacier ice archives nearly 15,000-year-old microbes and phages. Microbiome 2021;9(1):1–23. - PMC - PubMed
    1. Nishimura Y, Watai H, Honda T, et al. . Environmental viral genomes shed new light on virus-host interactions in the ocean. Msphere 2017;2(2):e00359–16. - PMC - PubMed
    1. Moineau S. Applications of phage resistance in lactic acid bacteria. Lactic Acid Bact 1999;76(1–4):377–82. - PubMed
    1. Brüssow H, Desiere F. Comparative phage genomics and the evolution of siphoviridae: insights from dairy phages. Mol Microbiol 2001;39(2):213–23. - PubMed

Publication types