Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Oct;22(10):2065-2073.
doi: 10.1038/s41592-025-02816-x. Epub 2025 Sep 15.

De novo discovery of conserved gene clusters in microbial genomes with Spacedust

Affiliations

De novo discovery of conserved gene clusters in microbial genomes with Spacedust

Ruoshi Zhang et al. Nat Methods. 2025 Oct.

Abstract

Metagenomics has revolutionized environmental and human-associated microbiome studies. However, the limited fraction of proteins with known biological processes and molecular functions presents a major bottleneck. In prokaryotes and viruses, evolution favors keeping genes participating in the same biological processes colocalized as conserved gene clusters. Conversely, conservation of gene neighborhood indicates functional association. Here we present Spacedust, a tool for systematic, de novo discovery of conserved gene clusters. To find homologous protein matches, Spacedust uses fast and sensitive structure comparison with Foldseek. Partially conserved clusters are detected using novel clustering and order conservation P values. We demonstrate Spacedust's sensitivity with an all-versus-all analysis of 1,308 bacterial genomes, identifying 72,843 conserved gene clusters containing 58% of the 4.2 million genes. It recovered 95% of antiviral defense system clusters annotated by the specialized tool PADLOC. Spacedust's high sensitivity and speed will facilitate the annotation of large numbers of sequenced bacterial, archaeal and viral genomes.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Spacedust algorithm.
a, Workflow. b, Starting from single-gene clusters, clusters are iteratively merged if this increases their conservation score until all scores are maximal. Colored boxes indicate pairs of homologous proteins found by Foldseek. Cluster conservation is measured by combining a clustering P value and an ordering P value.
Fig. 2
Fig. 2. Conservation of gene clusters identified by Spacedust predicts functional association.
a, Distribution of cluster sizes of all 106.6 million pairwise cluster matches among 1,308 bacterial genomes. b, Number of all (green), annotated (blue) and unannotated (orange) genes forming part of a cluster match in at least the number of genomes shown on the x axis. c,e, Precision of the functional association of gene pairs, separated by up to four genes in Spacedust cluster matches, versus the number of genomes in which the pair is conserved. True positive predictions are those gene pairs with the same KEGG module IDs. c, Foldseek+MMseqs search. e, Foldseek-only search with ProstT5. d,f, Precision versus recall of functional association of gene pairs separated by up to four genes. The analysis excludes ribosomal genes; see Extended Data Fig. 1 for analysis with ribosomal genes. d, Foldseek+MMseqs search. f, Foldseek-only search with ProstT5. AUC, area under the curve.
Fig. 3
Fig. 3. Evolutionary conservation of gene clusters in an example cyanobacterium.
a, Zoomed view of clustered hits of Synechocystis sp. PCC 6803 (genome ID 527) against 1,308 bacterial reference genomes. Query proteins with location indices 500–800 on the genome shown. Top, Cluster heat map of the presence/absence of clustered hits across reference bacteria. Middle, Transcription direction (black, forward; white, reverse). Bottom, Number of clustered hits per protein (blue) and hit pairs in the same gene cluster (pink). b, Example cyanobacteria-specific gene cluster 1. Gene names annotated by eggNOG-mapper are shown at the top. ATCC, American Type Culture Collection.
Fig. 4
Fig. 4. Spacedust recovers the vast majority of antiviral defense systems predicted by specialized tools.
a,b, Percentage (a) and number (b) of multi-gene antiviral defense system clusters predicted by PADLOC that are also discovered by Spacedust in their entirety (green circle), partially (blue square) or missed (red triangle) within 1,308 bacterial genomes. 95% of all defense system clusters predicted by PADLOC were recovered, of which 93% are in full length.
Fig. 5
Fig. 5. Prediction of 207 manually annotated BGCs from nine genomes.
ad, For each of the 207 BGCs, we computed the F1 score as the harmonic mean of recall and precision. The recall for a BGC is the fraction of genes in the BGC that have been predicted by the tool, and the precision is the fraction of genes in the predicted region that overlap the annotated BGC. Scatterplots of F1 scores of ClusterFinder versus Spacedust (a), DeepBGC versus Spacedust (b) and GECCO versus Spacedust (c) for the 207 annotated BGCs. d, Cumulative distribution of the F1 scores for the 207 BGCs.
Fig. 6
Fig. 6. Additional instances of CRISPR–Cas subtype III-E clusters identified in GTDB.
Visualization of the gene neighborhood of the clusters identified around the gene encoding Cas7-11 (orange arrow). Desulfonema ishimotonii (NZ_BEXT01000001.1), the representative of query III-E systems, is plotted as a reference to show the gene composition and order. Genes within the cluster boundary that could not be matched to Cas-related genes are colored in white.
Extended Data Fig. 1
Extended Data Fig. 1. Precision-recall (PR) of functional association of non-redundant conserved clusters.
(including the ribosomal genes) for (A) Foldseek+MMseqs search and (B) Foldseek- only search with 3Di sequences predicted by ProstT5, assessed by congruence of KEGG module IDs of Spacedust cluster matches for all gene pairs separated by up to 4 genes (i,i+1),…, (i,i+4).
Extended Data Fig. 2
Extended Data Fig. 2. Evolutionary conservation of gene clusters in a cyanobacterium Synechocystis sp. PCC6803.
Clustered hits of Synechocystis sp. PCC6803 (Genome ID 527) against 1308 bacterial reference genomes using Spacedust Foldseek+MMseqs2 search.
Extended Data Fig. 3
Extended Data Fig. 3. Evolutionary conservation of gene clusters in a cyanobacterium Synechocystis sp. PCC6803 (ProstT5).
Clustered hits of Synechocystis sp. PCC6803 (Genome ID 527) against 1308 bacterial reference genomes using Spacedust Foldseek-only search with 3Di sequences pre- dicted by ProstT5.
Extended Data Fig. 4
Extended Data Fig. 4. Evolutionary conservation of gene clusters in a cyanobacterium Synechocystis sp. PCC6803 (MMseqs2).
Clustered hits of Synechocystis sp. PCC6803 (Genome ID 527) against 1308 bacterial reference genomes using Spacedust MMseqs2 search.
Extended Data Fig. 5
Extended Data Fig. 5. Gene neighborhood of Cyanobacteria-specific cluster 1.
(Protein ID 510- 515), centered around protein 512.
Extended Data Fig. 6
Extended Data Fig. 6. Gene neighborhood of Cyanobacteria-specific cluster 2.
(Protein ID 648- 652), centered around protein 649.
Extended Data Fig. 7
Extended Data Fig. 7. Gene neighborhood of Cyanobacteria-specific cluster 3.
(Protein ID 655- 657), centered around protein 655.
Extended Data Fig. 8
Extended Data Fig. 8. Scatter plots of precision versus recall for the 207 annotated BGCs.
for (A) Clusterfinder, (B) DeepBGC, (C) GECCO and (D) Spacedust.
Extended Data Fig. 9
Extended Data Fig. 9. Contig view of 9 reference genomes with genomic regions.
predicted by ClusterFinder (green), DeepBGC (orange), GECCO (yellow) and Spacedust (blue) overlapping with the annotated BGCs (grey).
Extended Data Fig. 10
Extended Data Fig. 10. Example BGC regions (DS999641.1).
identified by ClusterFinder (green), DeepBGC (orange), GECCO (Yellow) and Spacedust (blue), superimposed upon annotated BGCs (grey) along with AntiSMASH (version 8) predictions and functional categories.

References

    1. Quince, C., Walker, A. W., Simpson, J. T., Loman, N. J. & Segata, N. Shotgun metagenomics, from sampling to analysis. Nat. Biotechnol.35, 833–844 (2017). - PubMed
    1. Almeida, A. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nat. Biotechnol.39, 105–114 (2021). - PMC - PubMed
    1. Nayfach, S. Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome. Nat. Microbiol.6, 960–970 (2021). - PMC - PubMed
    1. Chen, J. Global marine microbial diversity and its potential in bioprospecting. Nature633, 371–379 (2024). - PMC - PubMed
    1. Thomas, A. M. & Segata, N. Multiple levels of the unknown in microbiome research. BMC Biol.17, 48 (2019). - PMC - PubMed

LinkOut - more resources