Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Aug 22;178(5):1245-1259.e14.
doi: 10.1016/j.cell.2019.07.016. Epub 2019 Aug 8.

Large-Scale Analyses of Human Microbiomes Reveal Thousands of Small, Novel Genes

Affiliations

Large-Scale Analyses of Human Microbiomes Reveal Thousands of Small, Novel Genes

Hila Sberro et al. Cell. .

Abstract

Small proteins are traditionally overlooked due to computational and experimental difficulties in detecting them. To systematically identify small proteins, we carried out a comparative genomics study on 1,773 human-associated metagenomes from four different body sites. We describe >4,000 conserved protein families, the majority of which are novel; ∼30% of these protein families are predicted to be secreted or transmembrane. Over 90% of the small protein families have no known domain and almost half are not represented in reference genomes. We identify putative housekeeping, mammalian-specific, defense-related, and protein families that are likely to be horizontally transferred. We provide evidence of transcription and translation for a subset of these families. Our study suggests that small proteins are highly abundant and those of the human microbiome, in particular, may perform diverse functions that have not been previously reported.

Keywords: annotation; bacteria; bioinformatics; domain; genome; microbe; microbiome; phage; prediction; small open reading frame; small proteins.

PubMed Disclaimer

Conflict of interest statement

DECLARATION OF INTERESTS

N.G. is an employee and shareholder of One Codex. M.P.S. is a cofounder of Personalis, SensOmics, January, Filtricine, Akna, Qbio; he is on the advisory board of the companies he cofounded, along with Genapsys and Jupiter. A.S.B. is on the advisory board of Caribou Biosciences, January, and ArcBio. The authors declare no other competing financial interests.

Figures

Figure 1.
Figure 1.. Small Protein Discovery and Characterization Pipeline Applied to HMPI-II Metagenomic Data
(A) Identification of 29 known small proteins in HMPI-II metagenomes. More than 128 million contigs were annotated using MetaProdigal with a lower size limit of five amino acids. The small proteins were then clustered using CD-Hit based on amino acid similarity and protein length. Representatives of each of the ~444,000 clusters were queried against the Conserved Domain Database (CDD), to assign domains to clusters. The list of CDD domains was then queried for the small known proteins that have an assigned domain. Known small proteins that do not have an assigned domain or that failed the domain search were queried against HMPI-II small proteins using BLASTp. (B) Identification and characterization of HMPI-II small proteins. RNAcode was used to assign p values to the ~444,000 clusters. The following analyses were conducted on the ~4,000 protein families whose p value was ≤0.05. (1) Identification of neighboring genes on longest contig associated with each family. (2) Prediction of secondary structure. (3) Analysis of ribosomal binding sites (RBS) upstream of the small genes. (4) Taxonomic classification of contigs encoding each of the small protein families. (5) Assignment of small protein families to body sites. M-mouth; V - vagina; G - gut; S - skin. (6) Prediction of signal peptide and transmembrane domains to assign likely cellular localization. (7) Analysis of expression of the small genes using metatranscriptomic, metaproteomic datasets as well as Bacteroides thetaiotaomicron transcriptomics and proteomics. (8) Identification of homologs of small protein families in non-human metagenomes. See also Figures S1, S2, and S7, Tables S1, S2, S3, and S4, and Data S1 and S2.
Figure 2.
Figure 2.. Many of the ~4,000 Families, Some of which Are Very Abundant, Are Not Assigned a Known Protein Domain nor Are They Represented in RefSeq Genomes
(A) Pipeline to identify families that do not have an assigned domain and families that are not represented in RefSeq genomes. Upper path of the flow diagram: only a small subset of the ~4,000 small protein families were assigned a protein domain (identified by RPS-blast against CDD position specific scoring matrices, PSSMs). Lower path of the flow diagram: representatives of all ~4,000 families were blasted against ~3,000,000 small RefSeq annotated proteins originating from ~70,000 RefSeq genomes and against ~7,000,000 putative small proteins that we annotated using Prodigal with adjusted thresholds. The second step allowed the identification of an additional set of homologs that are encoded but not annotated in RefSeq genomes. (B) Domains identified among ~4,000 families. Domains that were classified to ≥5 families and/or ≥50 species are shown. A complete list of domains can be found in Table S3. (C) Number of species encoding small proteins of families with no known domain are shown in histogram.
Figure 3.
Figure 3.. A Subset of Small Protein Families Is Prevalent across the Tree of Life
(A) Most abundant families. Each row represents one of the 14 families that were identified in ≥100 species. The taxonomic distribution of the 14 families is presented in the blue table, the prevalence among body sites is presented in the green table and the number of homologs identified in non-human metagenomes is presented in the brown table. Potential novel ribosomal is family 26. When multiple homologs were mapped to the same taxa, it is counted as one event in this table. SCIFF, “six cysteines in forty-five residues.” (B) The fraction of families assigned to different number of phyla for the 14 potential housekeeping (red) and the 4,525 remaining families (blue) is shown. For example, >50% of the non-housing-keeping families were assigned to one phyla versus zero housekeeping families that were assigned to one phylum. (C and D) Potential novel ribosomal protein. (C) Phylogenetic tree of family 26. (D) The genomic neighborhood of DUF4295 (family 26) next to two known ribosomal proteins is illustrated. In Bacteroides thetaiotaomicron VPI-5482 it is encoded in the intergenic region downstream of these genes (locus tags BT0914 and BT0915). (E) Homology between family 26 and family 7858, two potential novel ribosome-associated families of proteins. Family 7858 is encoded by 26 species from 3 different phyla and did not pass the required ‘housekeeping’ threshold (which requires ≥100 species). The family 7858 gene is genomically positioned next to two ribosomal proteins; it is found in 85% of mouth samples (but not in any gut samples) as well as in diverse non-human environments. See also Figures S3 and S4 and Tables S1 and S3.
Figure 4.
Figure 4.. Small Proteins that Are Potentially Involved in Cross-Talk
(A–C) Family 350024 is an abundant gut-related predicted transmembrane family potentially involved in bacteria-host or bacteria-bacteria crosstalk. (A) Multiple sequence alignment of representatives of all families that share amino acid sequence homology with family 350024. The length of the protein sequence is indicated after each family ID. (B) Phylogenetic spread of family 350024 and 22 other homologous families. (C) Genomic neighborhood, next to a DNA binding protein and an N-acetylmuramoyl-L-alanine amidase, an enzyme that cleaves the amide bond between N-acetylmuramoyl and L-amino acids in bacterial cell walls. The locus tag of the small predicted transmembrane protein (red) is Ga0104402_10435 (Bacteroides ovatus NLAE-zl-C500). (D) Putative signaling molecule that is presumably subject to horizontal transfer. Schematic representation of genes encoded on contigs of family 155173. In addition to Agr genes, these contigs typically harbor genes that are associated with horizontal transfer. See also Figure S5 and Tables S3 and S5.
Figure 5.
Figure 5.. Small Proteins that Are Potentially Associated with Defense against Phage
(A and B) Small protein family (395508) possibly associated with a CRISPR anti-phage system. (A) Genomic neighborhood of small protein (red arrow) across 6 different species. Homologs of this small protein are shown in the genomic locus in which they were found among a variety of Veillonella species within HMPI-II data. (B) Multiple sequence alignment of homologs of the family demonstrates a high level of conservation within small protein family 395508. (C) Small protein of family 588 is encoded upstream of a known toxin.
Figure 6.
Figure 6.. Small Proteins that Are Potentially Subject to HGT between Phyla
(A) Each dot represents one of 202 families that were identified in the screen of HGT genes in vicinity of small gene and whose median percentage of k-mers that were classified is >10%. Families that are encoded by a small number of species across a larger number of phyla/class/order are more likely to be true positives. (B) Of the 100 families presented in (A), 57 small protein families that were identified in ≥2 phyla are presented. Only phyla that were identified in at least five different small gene families are shown. Numbers within boxes indicate the total number of individual homologs within the family encoded by the designated phylum. Each row was normalized. See also Figure S6 and Table S3.

Comment in

References

    1. Abu-Ali GS, Mehta RS, Lloyd-Price J, Mallick H, Branck T, Ivey KL, Drew DA, DuLong C, Rimm E, Izard J, et al. (2018). Metatranscriptome of human faecal microbial communities in a cohort of adult men. Nat. Microbiol 3, 356–366. - PMC - PubMed
    1. Allan E, Hussain HA, Crawford KR, Miah S, Ascott ZK, Khwaja MH, and Hosie AHF (2007). Genetic variation in comC, the gene encoding competence-stimulating peptide (CSP) in Streptococcus mutans. FEMS Microbiol. Lett 268, 47–51. - PubMed
    1. Almagro Armenteros JJ, Tsirigos KD, Sønderby CK, Petersen TN, Winther O, Brunak S, von Heijne G, and Nielsen H (2019). SignalP 5.0 improves signal peptide predictions using deep neural networks. Nat. Biotechnol 37, 420–23. - PubMed
    1. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, and Lipman DJ (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389–3402. - PMC - PubMed
    1. Bhadra P, Yan J, Li J, Fong S, and Siu SWI (2018). AmPEP: Sequence-based prediction of antimicrobial peptides using distribution patterns of amino acid properties and random forest. Sci. Rep 8, 1697. - PMC - PubMed

Publication types