Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Aug 20;26(1):254.
doi: 10.1186/s13059-025-03733-0.

Highly accurate prophage island detection with PIDE

Affiliations

Highly accurate prophage island detection with PIDE

Hongyan Gao et al. Genome Biol. .

Abstract

As important mobile elements in prokaryotes, prophages shape the genomic context of their hosts and regulate the structure of bacterial populations. However, it is challenging to precisely identify prophages through computational methods. Here, we introduce PIDE for identifying prophages from bacterial genomes or metagenome-assembled genomes. PIDE integrates a pre-trained protein language model and gene density clustering algorithm to distinguish prophages. Benchmarking with induced prophage sequencing datasets demonstrates that PIDE pinpoints prophages with precise boundaries. Applying PIDE to 4744 human gut representative genomes reveals 24,467 prophages with widespread functional capacity. PIDE is available at https://github.com/chyghy/PIDE , with model training code at https://zenodo.org/records/16457629 .

Keywords: Gene cluster; Human gut metagenome; Prophage identification; Protein language model.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: Not applicable. Consent for publication: Not applicable. Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
The PIDE framework for identifying PIs. a Schematic overview of PIDE. PIDE takes DNA sequences as input and outputs the predicted PI regions. b The detailed strategy employed by PIDE for clustering prophage ORFs and identifying prophage regions. D represents the distance, and it can vary depending on different requirements. c The performance of PIDE and the effect of ablation analysis. FT, fine-tuning. d The distribution of PIDE-predicted PIs across phage genomes under different D. The x-axis represents the number of PIs predicted by PIDE for each complete phage genome, and the y-axis indicates the number of complete phage genomes. e The x-axis represents different D values, and the y-axis indicates the proportion of complete phages that can be recovered by PIDE
Fig. 2
Fig. 2
The performance of PIDE and other tools in PI detection. a Heatmap illustrating the overlap of PIs predicted by different tools. Each row corresponds to a specific tool (geNomad, VirSorter2, PHASTER, PIDE), and the cells indicate the proportion of PIs predicted by that tool (row) which are also identified by other tools (columns). For example, the value 0.64 in the first row indicates that 64% of the PIs predicted by geNomad are also identified by VirSorter2. Density plots showing the distribution of coverage and identity for 132 overlapped-predicted PIs (b) and 89 unique-predicted PIs (c) aligned with virus entries of nt database. d Gene prediction classification by PIDE for PIDE-Ef PI. e The distribution of virome reads on the local genome of Enterococcus faecalis V583 (blue), with PIDE-Ef PI region highlighted in red. f The boxplot shows the distribution of precision and recall for PIs predicted by different tools, measured at the base level. g The distribution of virome reads on local Fusobacterium varium ATCC 27725 and Bifidobacterium bifidum ATCC 29521 (blue), with PI regions predicted by different tools highlighted in red
Fig. 3
Fig. 3
The performance of PIDE in identifying assembled phage contigs. a F1 score versus FDR curve at various sequence lengths. b The sequence classification performance of PIDE and VirRep across sequences of varying length. Performance was measured using precision, recall, and F1 score
Fig. 4
Fig. 4
The application of PIDE to 4744 representative prokaryotic genomes from the human gut microbiome. The violin plot displays the distribution of the number of PIs on each genome across the top four most abundant phyla (a) or the top twenty most abundant genera (b). c eggNOG annotation of PIs, where the x-axis represents the prevalence in prokaryotic genomes and y-axis represents different eggNOG categories. d The relative abundance of PI-encoded ARGs across different phyla within various categories. e The distribution of the number of PI-encoded genes across different phyla in various metabolic pathways

Similar articles

References

    1. Batinovic S, Wassef F, Knowler SA, Rice DTF, Stanton CR, Rose J, et al. Bacteriophages in natural and artificial environments. Pathogens. 2019;8(3):100. - PMC - PubMed
    1. Liang G, Bushman FD. The human virome: assembly, composition and host interactions. Nat Rev Microbiol. 2021;19(8):514–27. 10.1038/s41579-021-00536-5. - PMC - PubMed
    1. Liang G, Zhao C, Zhang H, Mattei L, Sherrill-Mix S, Bittinger K, et al. The stepwise assembly of the neonatal virome is modulated by breastfeeding. Nat. 2020;581(7809):470–4. http://www.nature.com/articles/s41586-020-2192-1 - PMC - PubMed
    1. Brum JR, Hurwitz BL, Schofield O, Ducklow HW, Sullivan MB. Seasonal time bombs: dominant temperate viruses affect Southern Ocean microbial dynamics. ISME J. 2016;10(2):437–49. - PMC - PubMed
    1. Jamet A, Touchon M, Ribeiro-Gonçalves B, Carriço JA, Charbit A, Nassif X, et al. A widespread family of polymorphic toxins encoded by temperate phages. BMC Biol. 2017;15:1–12. - PMC - PubMed

LinkOut - more resources