. 2022 Mar 31:11:e67667.

doi: 10.7554/eLife.67667.

Unifying the known and unknown microbial coding sequence space

Chiara Vanni^{1

2}, Matthew S Schechter^{1

3}, Silvia G Acinas⁴, Albert Barberán⁵, Pier Luigi Buttigieg⁶, Emilio O Casamayor⁷, Tom O Delmont⁸, Carlos M Duarte⁹, A Murat Eren^{3

10}, Robert D Finn¹¹, Renzo Kottmann¹, Alex Mitchell¹¹, Pablo Sánchez⁴, Kimmo Siren¹², Martin Steinegger^{13

14}, Frank Oliver Gloeckner^{2

15

16}, Antonio Fernàndez-Guerra^{1

17}

Affiliations

¹ Microbial Genomics and Bioinformatics Research G, Max Planck Institute for Marine Microbiology, Bremen, Germany.
² Jacobs University Bremen, Bremen, Germany.
³ Department of Medicine, University of Chicago, Chicago, United States.
⁴ Department of Marine Biology and Oceanography, Institut de Ciències del Mar (CSIC), Barcelona, Spain.
⁵ Department of Environmental Science, University of Arizona, Tucson, United States.
⁶ Alfred Wegener Institute, Helmholtz Centre for Polar and Marine Research, Alfred Wegener Institute, Bremerhaven, Germany.
⁷ Center for Advanced Studies of Blanes CEAB-CSIC, Spanish Council for Research, Blanes, Spain.
⁸ Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, Université Paris-Saclay, Evry, France.
⁹ Red Sea Research Centre and Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.
¹⁰ Josephine Bay Paul Center, Marine Biological Laboratory, Woods Hole, United States.
¹¹ European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, United Kingdom.
¹² Section for Evolutionary Genomics, The GLOBE Institute, University of Copenhagen, Copenhagen, Denmark.
¹³ School of Biological Sciences, Seoul National University, Seoul, Republic of Korea.
¹⁴ Institute of Molecular Biology and Genetics, Seoul National University, Seoul, Republic of Korea.
¹⁵ University of Bremen and Life Sciences and Chemistry, Bremen, Germany.
¹⁶ Computing Center, Helmholtz Center for Polar and Marine Research, Bremerhaven, Germany.
¹⁷ Lundbeck Foundation GeoGenetics Centre, GLOBE Institute, University of Copenhagen, Copenhagen, Denmark.

PMID: 35356891
PMCID: PMC9132574
DOI: 10.7554/eLife.67667

Unifying the known and unknown microbial coding sequence space

Chiara Vanni et al. Elife. 2022.

. 2022 Mar 31:11:e67667.

doi: 10.7554/eLife.67667.

Authors

Affiliations

¹ Microbial Genomics and Bioinformatics Research G, Max Planck Institute for Marine Microbiology, Bremen, Germany.
² Jacobs University Bremen, Bremen, Germany.
³ Department of Medicine, University of Chicago, Chicago, United States.
⁴ Department of Marine Biology and Oceanography, Institut de Ciències del Mar (CSIC), Barcelona, Spain.
⁵ Department of Environmental Science, University of Arizona, Tucson, United States.
⁶ Alfred Wegener Institute, Helmholtz Centre for Polar and Marine Research, Alfred Wegener Institute, Bremerhaven, Germany.
⁷ Center for Advanced Studies of Blanes CEAB-CSIC, Spanish Council for Research, Blanes, Spain.
⁸ Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, Université Paris-Saclay, Evry, France.
⁹ Red Sea Research Centre and Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.
¹⁰ Josephine Bay Paul Center, Marine Biological Laboratory, Woods Hole, United States.
¹¹ European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, United Kingdom.
¹² Section for Evolutionary Genomics, The GLOBE Institute, University of Copenhagen, Copenhagen, Denmark.
¹³ School of Biological Sciences, Seoul National University, Seoul, Republic of Korea.
¹⁴ Institute of Molecular Biology and Genetics, Seoul National University, Seoul, Republic of Korea.
¹⁵ University of Bremen and Life Sciences and Chemistry, Bremen, Germany.
¹⁶ Computing Center, Helmholtz Center for Polar and Marine Research, Bremerhaven, Germany.
¹⁷ Lundbeck Foundation GeoGenetics Centre, GLOBE Institute, University of Copenhagen, Copenhagen, Denmark.

PMID: 35356891
PMCID: PMC9132574
DOI: 10.7554/eLife.67667

Abstract

Genes of unknown function are among the biggest challenges in molecular biology, especially in microbial systems, where 40-60% of the predicted genes are unknown. Despite previous attempts, systematic approaches to include the unknown fraction into analytical workflows are still lacking. Here, we present a conceptual framework, its translation into the computational workflow AGNOSTOS and a demonstration on how we can bridge the known-unknown gap in genomes and metagenomes. By analyzing 415,971,742 genes predicted from 1749 metagenomes and 28,941 bacterial and archaeal genomes, we quantify the extent of the unknown fraction, its diversity, and its relevance across multiple organisms and environments. The unknown sequence space is exceptionally diverse, phylogenetically more conserved than the known fraction and predominantly taxonomically restricted at the species level. From the 71 M genes identified to be of unknown function, we compiled a collection of 283,874 lineage-specific genes of unknown function for Cand. Patescibacteria (also known as Candidate Phyla Radiation, CPR), which provides a significant resource to expand our understanding of their unusual biology. Finally, by identifying a target gene of unknown function for antibiotic resistance, we demonstrate how we can enable the generation of hypotheses that can be used to augment experimental data.

Keywords: bioinformatics; computational biology; functional metageomics; gene clusters; infectious disease; microbial genomics; microbiology; phylogenomics; systems biology; unknown function.

Plain language summary

It is estimated that scientists do not know what half of microbial genes actually do. When these genes are discovered in microorganisms grown in the lab or found in environmental samples, it is not possible to identify what their roles are. Many of these genes are excluded from further analyses for these reasons, meaning that the study of microbial genes tends to be limited to genes that have already been described. These limitations hinder research into microbiology, because information from newly discovered genes cannot be integrated to better understand how these organisms work. Experiments to understand what role these genes have in the microorganisms are labor-intensive, so new analytical strategies are needed. To do this, Vanni et al. developed a new framework to categorize genes with unknown roles, and a computational workflow to integrate them into traditional analyses. When this approach was applied to over 400 million microbial genes (both with known and unknown roles), it showed that the share of genes with unknown functions is only about 30 per cent, smaller than previously thought. The analysis also showed that these genes are very diverse, revealing a huge space for future research and potential applications. Combining their approach with experimental data, Vanni et al. were able to identify a gene with a previously unknown purpose that could be involved in antibiotic resistance. This system could be useful for other scientists studying microorganisms to get a more complete view of microbial systems. In future, it may also be used to analyze the genetics of other organisms, such as plants and animals.

PubMed Disclaimer

Conflict of interest statement

CV, MS, SA, AB, PB, EC, TD, CD, AE, RF, RK, AM, PS, KS, MS, FG, AF No competing interests declared

Figures

**Figure 1.. Conceptual framework to unify the known and unknown sequence space and integration of the framework in the current analytical workflows.**
(A) Link between the conceptual framework and the computational workflow to partition the sequence space in the four conceptual categories. AGNOSTOS infers, validates and refines the GCs and combines them in gene cluster communities (GCCs). Then, it classifies them in one of the four conceptual categories based on their level of ‘darkness’. Finally, we add context to each GC based on several sources of information, providing a robust framework for generating hypotheses that can be used to augment experimental data. (B) The computational workflow provides two mechanisms to structure sequence space using GCs, de novo creation of the GCs (*DB creation*), or integrating the dataset in an existing GC database (*DB update*). The structured sequence space can then be plugged into traditional analytical workflows to annotate the genes within each GC of the known fraction. With AGNOSTOS, we provide the opportunity to integrate the unknown fraction into microbiome analyses easily. (C) The versatility of the GCs enables analyses at different scales depending on the scope of our experiments. We can group GCs in gene cluster communities based on their shared homologies to perform coarse-grained analyses. On the other hand, we can design fine-grained analyses using the relationships between the genes in a GC, that is detecting network modules in the GC inner sequence similarity network. Additionally, given that GCs are conserved across environments, organisms and experimental conditions give us access to an unprecedented amount of information to design and interpret experimental data.

**Figure 2.. Overview and validation of the workflow to aggregate GCs in communities.**
(A) We inferred a gene cluster homology network using the results of an all-vs-all HMM gene cluster comparison with HHBLITS. The edges of the network are based on the HHblits-score/Aligned-columns. Communities are identified by an iterative screening of different MCL inflation parameters and evaluated using five different metrics that consider the inter- and intra-community properties. (B) Comparison of the number of GCs and GCCs for each of the functional categories. (C) Validation of the GCCs inference based on the environmental genes annotated as proteorhodopsins. Ribbons in the alluvial plot are genes, and each stacked bar corresponds (from left to right) to the (1) gene taxonomic classification at the domain level, (2) GC membership, (3) GCC membership and (4) MicRhoDE operational classification. (D) Validation of the GCCs inference based on ribosomal proteins based on standard and high-quality GCs.

**Figure 3.. The extent of the known and unknown sequence space.**
(A) Proportion of genes in the known and unknown. (B) Accumulation curves for the known and unknown sequence space at the GC- level for the metagenomic and genomic data. from TARA, MALASPINA, OSD2014 and HMP-I/II projects. (C) Collector curves comparing the human and marine biomes. Colored lines represented the mean of 1000 permutations and shaded in gray the standard deviation. Non-abundant singleton clusters were excluded from the accumulation curves calculation. (D) Amino acid distribution in the known and unknown sequence space. In all cases, the four categories have been simplified as known (K, KWP) and unknown (GU, EU).

**Figure 4.. Distribution of the unknown sequence space in the human and marine metagenomes.**
(A) Ratio between the proportion of the number of genes and their estimated abundances per cluster category and biome. Columns represented in the facet depicts three cluster categories based on the size of the clusters. (B) Relationship between the ratio of Genomic unknowns and Environmental unknowns in the HMP-I/II metagenomes. Gastrointestinal tract metagenomes are enriched in Genomic unknown sequences compared to the other body sites. (C) Relationship between the ratio of Genomic unknowns and Environmental unknowns in the TARA Oceans metagenomes. Girus- and virus-enriched metagenomes show a higher proportion of both unknown sequences (genomic and environmental) than the Archaea|Bacteria enriched fractions. (D) Environmental distribution of GCs and GCCs based on Levin’s niche breadth index. We obtained the significance values after generating 100 null gene cluster abundance matrices using the *quasiswap* algorithm.

**Figure 5.. Phylogenomic exploration of the unknown sequence space.**
(A) Distribution of the lineage-specific GCs by taxonomic level. Lineage-specific unknown GCs are more abundant in the lower taxonomic levels (genus, species). (B) Phylogenetic conservation of the known and unknown sequence space in 27,372 bacterial genomes from GTDB_r86. We observe differences in the conservation between the known and the unknown sequence space for lineage- and non-lineage specific GCs (paired Wilcoxon rank-sum test; all p-values < 0.0001). (C) The majority of the lineage-specific clusters are part of the unknown sequence space, and only a small proportion was found in prophages present in the GTDB_r86 genomes. (D) Known and unknown sequence space of the 27,732 GTDB_r86 bacterial genomes grouped by bacterial phyla. Phyla are partitioned based on the ratio of known to unknown GCs and vice versa. Phyla enriched in MAGs have higher proportions in GCs of unknown function. Phyla with a high proportion of non-classified clusters (NC; discarded during the validation steps) tend to contain a small number of genomes. (E) The alluvial plot’s left side shows the uncharacterized (OM-RGC v2 GC) and characterized (OM-RGC v2) fraction of the gene catalog. The functional annotation is based on the eggNOG annotations provided by Salazar et al., 2019. The right side of the alluvial plot shows the new organization of the OM-RGC v2 sequence space based on the approach described in this study. The treemap in the right links the metagenomic and genomic space adding context to the unknown fraction of the OM-RGC v2.

**Figure 6.. Augmenting experimental data with GCs of unknown function.**
(A) We used the fitness values from the experiments from Price et al., 2018 to identify genes of unknown function that are important for fitness under certain experimental conditions. The selected gene belongs to the genomic unknown GC GU_19737823 and presents a strong phenotype (fitness = –3.1; t = –9.1) (B) Occurrence of GU_19737823 in the metagenomes used in this study. Darker bars depict the number of metagenomes where the GC is found. (C) GU_19737823 is a member of the GCC GU_c_21103. The network shows the relationships between the different GCs members of the gene cluster community GU_c_21103. The size of the node corresponds to the node degree of each GC. Edge thickness corresponds to the bitscore/column metric. Highlighted in red is GU_19737823. (D) We identified all the genes in the GTDB_r86 genomes that belong to the GCC GU_c_21103 and explored their genomic neighborhoods. GU_c_21103 members were constrained to the class *Gammaproteobacteria*, and GU_19737823 is mostly exclusive to the order *Pseudomonadales*. The gene order in the different genomes analyzed is highly conserved, finding GU_19737823 after the *rpsF::rpsR* operon and before *rpll. rpsF* and *rpsR* encode for the *30 S ribosomal protein S6* and *30 S ribosomal protein S18,* respectively. The GTDB_r86 subtree only shows RefSeq genomes. Branch colors correspond to the different GCs found in GU_c_21103. The bubble plot depicts the number of genomes with a gene that belongs to GU_c_21103.

**Appendix 1—figure 1.. Overview of the workflow to partition the genomic and metagenomic sequence space between known and unknown.**
The workflow performs gene prediction, gene clustering, gene clustering validation and refinement, GCC inference, and partitions the sequence space in the different known and unknown categories.

**Appendix 1—figure 2.. The diagram shows a schematic description of the number of genes and GCs that have been kept or discarded.**
(A) We analyzed a dataset of 1749 metagenomes from marine and human environments and 28,941 genomes from the GTDB_r86 summing up to 415,971,742 genes. The composition of the genomic box ‘Other’ is described in Appendix Note 5. (B) GC overlap between the environmental and genomic datasets.

**Appendix 1—figure 3.. Proportion of complete genes per cluster.**
Distribution of observed values compared with those generated by the Broken-stick model. The cut-off was determined at 34% complete genes per cluster.

**Appendix 1—figure 4.. Collector curves for the known and unknown sequence space.**
(A) Collector curves at the gene cluster level, for the TARA metagenomes, including the viral fraction (left) and excluding it (right) from the analysis. (B) Collector curves at gene cluster community level for the metagenomes from TARA, MALASPINA, and HMP-I/II projects (left) and the 28,941 GTDB genomes (right).

Appendix 1—figure 5.. Collector curves for the known and unknown sequence space at the gene cluster level for (A) the metagenomes from TARA, MALASPINA and HMP-I/II projects, and for (B) the 28,941 GTDB genomes.
Singletons were excluded from the calculations.

**Appendix 1—figure 6.. Proportion of gene cluster categories per biome.**
On the y-axis are reported the 11 main biome categories indicated by MGnify and in parenthesis the total number of genes in each biome. The gray fraction represents the pool of genes from MGnify that were not found in our dataset.

**Appendix 1—figure 7.. HMP outlier samples enriched in (A) crAssphages, and (B) papillomaviruses (HPV).**

**Appendix 1—figure 8.. EggNOG annotations entropy within the GCs (A) and the GCCs (B).**
The entropy was calculated using the function *entropy.empirical*() from the R package ‘entropy’, which estimates the Shannon entropy values based on the value empirical frequencies.

**Appendix 3—figure 1.. Proportion of outlier genes detected within each cluster MSA.**
Distribution of observed values compared with those generated by the Broken-stick model. The cut-off was determined at 10% outlier genes per cluster.

**Appendix 5—figure 1.. Proportion of outlier genomic genes identified within each cluster MSA.**
Distribution of observed values compared with those of the Broken-stick model.

**Appendix 5—figure 2.. Comparison of the clustering results obtained with the one-step and two-step approach in terms of cluster composition.**

**Appendix 7—figure 1.. Radar plots used to determine the best MCL inflation value for the partitioning of the K into cluster components.**
The plots were built using a combination of five variables: 1 = proportion of clusters with one component and 2 = proportion of clusters with more than one member, 3 = clan entropy (proportion of clusters with entropy = 0), 4 = intra HHblits-Score/Aligned-columns (normalized by the maximum value), and 5 = number of clusters (related to the non-redundant set of DAs). (A) Metagenomic dataset. (B) Genomic dataset.

**Appendix 7—figure 2.. Cluster pairs distribution based on the metrics used to weight the gene cluster HMM-HMM homology network.**
(A) HHblits-Score/Aligned-columns (Vanni et al., 2021). (B) maximum(HHblits-probability x coverage) (Méheust et al.).

**Appendix 7—figure 3.. Determination of the edge-weight metrics for the GC HMM-HMM homology network.**
We tested the metrics used in Méheust et al. and this paper (Vanni et al.). The correlations between metrics are shown per functional category. The metric used by Méheust et al. corresponds to the maximum(HHblits-probability x coverage). The metric applied in this manuscript is *HHblits-Score/Aligned-columns*. (A) Comparison between the metric of Méheust et al. and the HHblits-Probability. (B) Comparison between the metric used in this manuscript and the HHblits-Probability. (C) Comparison between the metric used in this manuscript and the metric of Méheust et al.

**Appendix 7—figure 4.. Agreement between the number of communities within ribosomal protein families between our approach and the one described in Méheust et al.**

**Appendix 9—figure 1.. Coverage of external datasets.**
The bar plot is showing the proportion of covered genes in each of the seven datasets that were screened against the metagenomic set of clusters’ HMM profiles.

**Appendix 10—figure 1.. Broadly distributed EU mapping on TARA MAGs results.**
(A) . Histogram of TARA MAG percent completeness (checkM). The red line represents the number of EU found in the MAGs. (B) Contigs from TARA MAGs TARA_ANW_MAG_00076 in descending order of highest proportion of non-hypothetical gene content. (C) EU communities in the context of a MAG contig. Contig genomic neighborhood around two potential EU communities.

**Appendix 11—figure 1.. Phylogenomic exploration of the unknown sequence space in Archaea.**
(A) Distribution of the lineage-specific gene clusters by taxonomic level. Lineage-specific unknown gene clusters are more abundant at the lower taxonomic levels (genus, species). (B) Phylogenetic conservation of the known and unknown sequence space in 1,569 archaeal genomes from GTDB. We calculated the mean trait depth (add symbol _D) with the consenTRAIT algorithm and the lineage specificity using the F1-score approach from Mendler et al., 2019. We observe differences in the conservation between the known and the unknown sequence space for lineage- and non-lineage-specific gene clusters (paired Wilcoxon rank-sum test; all P-values < 0.0001). (C) The majority of the lineage-specific clusters are part of the unknown sequence space, being a small proportion found in prophages present in the GTDB genomes. (D) Known and unknown sequence space of the 1,569 GTDB archaeal genomes grouped by archaeal phyla. Phyla are partitioned based on the ratio of known to unknown gene clusters and vice versa from the set of genomes. Phyla enriched in Metagenomic assembled genomes (MAGs) have a higher proportion in gene clusters of unknown function.

**Appendix 12—figure 1.. *Cand* Patescibacteria metagenomic lineage-specific clusters.**
(A) Phylogenetic tree of *Cand*. Patescibacteria genera, colored by classes. The heatmaps around the tree show the proportion of lineage-specific gene clusters of knowns and unknowns in the metagenomes from TARA, Malaspina and the HMP. (B) Metagenomic lineage-specific clusters in the class of *Gracilibacteria*.

See this image and copyright information in PMC

References

1. Almeida A, Mitchell AL, Boland M, Forster SC, Gloor GB, Tarkowska A, Lawley TD, Finn RD. A new genomic blueprint of the human gut microbiota. Nature. 2019;568:499–504. doi: 10.1038/s41586-019-0965-1. - DOI - PMC - PubMed
1. Almeida A, Nayfach S, Boland M, Strozzi F, Beracochea M, Shi ZJ, Pollard KS, Sakharova E, Parks DH, Hugenholtz P, Segata N, Kyrpides NC, Finn RD. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nature Biotechnology. 2021;39:105–114. doi: 10.1038/s41587-020-0603-3. - DOI - PMC - PubMed
1. Anantharaman K, Hausmann B, Jungbluth SP, Kantor RS, Lavy A, Warren LA, Rappé MS, Pester M, Loy A, Thomas BC, Banfield JF. Expanded diversity of microbial groups that shape the dissimilatory sulfur cycle. The ISME Journal. 2018;12:1715–1728. doi: 10.1038/s41396-018-0078-0. - DOI - PMC - PubMed
1. Arnold FH. Design by Directed Evolution. Accounts of Chemical Research. 1998;31:125–131. doi: 10.1021/ar960017f. - DOI
1. Arnold FH. Directed Evolution: Bringing New Chemistry to Life. Angewandte Chemie (International Ed. in English) 2018;57:4143–4148. doi: 10.1002/anie.201708408. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Unifying the known and unknown microbial coding sequence space

Affiliations

Unifying the known and unknown microbial coding sequence space

Authors

Affiliations

Abstract

Plain language summary

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous