Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Aug 5;20(8):e1012343.
doi: 10.1371/journal.pcbi.1012343. eCollection 2024 Aug.

Database size positively correlates with the loss of species-level taxonomic resolution for the 16S rRNA and other prokaryotic marker genes

Affiliations

Database size positively correlates with the loss of species-level taxonomic resolution for the 16S rRNA and other prokaryotic marker genes

Seth Commichaux et al. PLoS Comput Biol. .

Abstract

For decades, the 16S rRNA gene has been used to taxonomically classify prokaryotic species and to taxonomically profile microbial communities. However, the 16S rRNA gene has been criticized for being too conserved to differentiate between distinct species. We argue that the inability to differentiate between species is not a unique feature of the 16S rRNA gene. Rather, we observe the gradual loss of species-level resolution for other nearly-universal prokaryotic marker genes as the number of gene sequences increases in reference databases. This trend was strongly correlated with how represented a taxonomic group was in the database and indicates that, at the gene-level, the boundaries between many species might be fuzzy. Through our study, we argue that any approach that relies on a single marker to distinguish bacterial taxa is fraught even if some markers appear to be discriminative in current databases.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Workflow diagram of the analysis done for the SILVA database and the GTDB.
The SILVA and GTDB were downloaded and sequences with incomplete taxonomic labels or from mitochondria and plastids were removed. To create the simulated databases for each marker gene, we created a collection of random subsets varying in size from 10,000 to 200,000 sequences in 10,000 gene increments. Each simulated database was clustered at 95%, 97%, 99%, and 100% identity requiring that shorter sequences fully align to longer ones.
Fig 2
Fig 2. Clustering analysis for simulated databases created by randomly sampling sequences from the 16S rRNA SILVA database and the 120 marker gene Genome Taxonomy Database (GTDB).
Each simulated database was clustered at 95%, 97%, 99%, and 100% identity requiring that shorter sequences fully align to longer ones. The 16S rRNA gene is denoted by a star in all subplots. A) The relationship between the number of genes in the simulated databases, the number of clusters, the number of multi-species clusters, and the number of sequences in multi-species clusters. For GTDB, each curve is for one of the 120 marker genes. B) The rate at which sequences were recruited to multi-species clusters as the database grows. Each point represents one of the 120 marker genes in the GTDB. C) The percentage of species with sequences in multi-species clusters. D) The relationship between the number of multi-species clusters that a species belongs to and the species richness of its genus (i.e., the total number of species from that genus) in the simulated database. This data was only taken from the final iteration of the simulated databases. The results were aggregated across all 120 marker genes in the GTDB.
Fig 3
Fig 3. Workflow diagram of the analysis done for the Listeria marker gene simulated databases (16S rRNA and 40 marker genes).
First, 5,014 Listeria draft genomes were downloaded from RefSeq and the 16S rRNA and 40 markers genes were predicted with Barnap and FetchMG, respectively. Genes that were below half or above twice as long as the mean length for a specific marker gene were removed. To create the simulated databases for each marker gene, we randomly subsampled the sequences into subsets varying in size from 1,000 to 5,000 sequences in 1,000 gene increments. We repeated this process 100 times so we could estimate the variability of our results. Each simulated database was clustered at 95%, 97%, 99%, and 100% identity requiring that shorter sequences fully align to longer ones.
Fig 4
Fig 4. Clustering analysis for the simulated databases created by randomly sampling sequences from the 16S rRNA and the 40 marker genes extracted from 5,014 Listeria genomes.
Each simulated database was clustered at 95%, 97%, 99%, and 100% identity requiring that shorter sequences fully align to longer ones. The results for each gene are reported by the median over 100 bootstrap experiments. The 16S rRNA gene is denoted by a star in all subplots. A) The relationship between the number of genes in the simulated databases, the number of clusters, the number of multi-species clusters, and the number of sequences in multi-species clusters. Each curve represents one of the 40 marker genes. The starred curve represents the 16S rRNA gene B) The rate at which sequences were recruited to multi-species clusters as the database grows. Each point represents one of the 40 marker genes. C) The percentage of species with sequences in multi-species clusters.

Update of

References

    1. Wang Q, Garrity GM, Tiedje JM, Cole JR. Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy. Appl Environ Microbiol. 2007;73(16):5261–7. - PMC - PubMed
    1. McDonald D, Price MN, Goodrich J, Nawrocki EP, DeSantis TZ, Probst A, et al.. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J. 2012;6(3):610–8. doi: 10.1038/ismej.2011.139 - DOI - PMC - PubMed
    1. Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, et al.. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res. 2013;41(Database issue):D590–6. doi: 10.1093/nar/gks1219 - DOI - PMC - PubMed
    1. Lan Y, Rosen G, Hershberg R. Marker genes that are less conserved in their sequences are useful for predicting genome-wide similarity levels between closely related prokaryotic strains. Microbiome. 2016;4(1):18. doi: 10.1186/s40168-016-0162-5 - DOI - PMC - PubMed
    1. Olm MR, Crits-Christoph A, Diamond S, Lavy A, Matheus Carnevali PB, Banfield JF. Consistent Metagenome-Derived Metrics Verify and Delineate Bacterial Species Boundaries. mSystems. 2020;5(1). doi: 10.1128/mSystems.00731-19 - DOI - PMC - PubMed

LinkOut - more resources