Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jun 21;39(12):110984.
doi: 10.1016/j.celrep.2022.110984.

Thousands of small, novel genes predicted in global phage genomes

Collaborators, Affiliations

Thousands of small, novel genes predicted in global phage genomes

Brayon J Fremin et al. Cell Rep. .

Abstract

Small genes (<150 nucleotides) have been systematically overlooked in phage genomes. We employ a large-scale comparative genomics approach to predict >40,000 small-gene families in ∼2.3 million phage genome contigs. We find that small genes in phage genomes are approximately 3-fold more prevalent than in host prokaryotic genomes. Our approach enriches for small genes that are translated in microbiomes, suggesting the small genes identified are coding. More than 9,000 families encode potentially secreted or transmembrane proteins, more than 5,000 families encode predicted anti-CRISPR proteins, and more than 500 families encode predicted antimicrobial proteins. By combining homology and genomic-neighborhood analyses, we reveal substantial novelty and diversity within phage biology, including small phage genes found in multiple host phyla, small genes encoding proteins that play essential roles in host infection, and small genes that share genomic neighborhoods and whose encoded proteins may share related functions.

Keywords: CP: Microbiology; MetaRibo-Seq; comparative genomics; gene families; microbiome; phage; sORFs; small genes.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests The authors declare no competing interests.

Figures

Figure 1.
Figure 1.. Pipeline to identify and characterize small genes in phages
(A) Identifying small genes in phages: 2,377,994 phage contigs were annotated using MetaProdigal, with a lower gene length cutoff of 15bp. Proteins encoded by these small genes were clustered at 50% aa identity using CD-Hit. A comparative-genomics approach using RNAcode was applied to the resulting 633,684 clusters, generating 41,150 small-gene families. (B) Characterizing small genes in phages. Several analyses were performed on these 41,150 small-gene families, including genomic-neighborhood analysis, prediction of anti-CRISPRs, taxonomic classification of both viruses and possible hosts containing these small genes, and prediction of cellular localization of proteins encoded by small genes.
Figure 2.
Figure 2.. Summary statistics on the Fremin gp40K
(A) Histogram showing the distribution of protein lengths among families in the Fremin gp40K. (B) Histogram displaying the number of sequences present in each family in the Fremin gp40K. (C) Histogram showing the number of families in which members were assigned ribosome-binding sites. (D) Histogram displaying the number of families found in specific ecosystems.
Figure 3.
Figure 3.. Comparative genomics enriches for real small genes
(A) Enriching for small genes encoding proteins with known protein domains. The bar plot shows the percentage of small-gene clusters encoding proteins that contain known protein domains, including all phage clusters (n = 633,683), the Fremin gp40K (n = 41,150), all human microbiome clusters from Sberro et al. (2019) (n = 444,054), and the Sberro hm4K (n = 4,539). (B) Enriching for small genes that are translated in human microbiomes. The bar plot shows the percentage of genes with a MetaRibo-Seq signal, including all genes (annotated using default MetaProdigal), small genes homologous to the Fremin gp40K, small genes homologous to the Sberro hm4K, and all small genes. Fisher’s exact test was used to compare between groups. (***p < 0.0001).
Figure 4.
Figure 4.. Comparing the Fremin gp40K with the Sberro hm4K
(A) Overlap of Fremin gp40K and Sberro hm4K datasets. The flowchart displays the use of BLASTp to determine that 3,344 of the 40K families were homologous to 1,961 of the Sberro hm4K families. (B) Families encoding proteins with known protein domains. The histogram shows the number of families in the 40K encoding proteins that were annotated with specific protein domains and which of those were homologous to the Sberro hm4K for the top 30 most commonly assigned protein domains. (C) Taxonomy of the Fremin gp40K. The histogram shows the number of families that were classified at various taxonomic levels and the taxonomic classifications of predicted hosts of families. Families with no taxonomic assignment were classified as “NA.”
Figure 5.
Figure 5.. Multi-host small-gene families
Homology between multi-host families and the Fremin gp40K. Visual representing homology and ecosystem metadata between the multi-host small-gene families and other small-gene families within the Fremin gp40K. We indicate the number of small genes in each family that belongs to a specific taxa or ecosystem.
Figure 6.
Figure 6.. Novel gp51 small-gene families
Cladogram showing 111 proteins encoded by novel small-gene families homologous to family #27. All genes in these families were found near genes encoding other baseplate proteins. Each family contained at least three unique homologs (not shown in the tree).

References

    1. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, and Lipman DJ (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389–3402. 10.1093/nar/25.17.3389. - DOI - PMC - PubMed
    1. Anisimova M, and Gascuel O (2006). Approximate likelihood-ratio test for branches: a fast, accurate, and powerful alternative. Syst. Biol 55, 539–552. 10.1080/10635150600755453. - DOI - PubMed
    1. Arisaka F, Yap ML, Kanamaru S, and Rossmann MG (2016). Molecular assembly and structure of the bacteriophage T4 tail. Biophys. Rev 8, 385–396. 10.1007/s12551-016-0230-x. - DOI - PMC - PubMed
    1. Almagro Armenteros JJ, Tsirigos KD, Sønderby CK, Petersen TN, Winther O, Brunak S, von Heijne G, and Nielsen H (2019). SignalP 5.0 improves signal peptide predictions using deep neural networks. Nat. Biotechnol 37, 420–423. 10.1038/s41587-019-0036-z. - DOI - PubMed
    1. Baranov PV, Gurvich OL, Fayet O, Prère MF, Miller WA, Gesteland RF, Atkins JF, and Giddings MC (2001). RECODE: a database of frameshifting, bypassing and codon redefinition utilized for gene expression. Nucleic Acids Res 29, 264–267. 10.1093/nar/29.1.264. - DOI - PMC - PubMed

Publication types