. 2024 Aug 5;20(8):e1012343.

doi: 10.1371/journal.pcbi.1012343. eCollection 2024 Aug.

Database size positively correlates with the loss of species-level taxonomic resolution for the 16S rRNA and other prokaryotic marker genes

Seth Commichaux¹, Tu Luan^{2

3}, Harihara Subrahmaniam Muralidharan^{2

3}, Mihai Pop^{2

3}

Affiliations

¹ Center for Food Safety and Nutrition, Food and Drug Administration, Laurel, Maryland, United States of America.
² Department of Computer Science, University of Maryland, College Park, Maryland, United States of America.
³ Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America.

PMID: 39102435
PMCID: PMC11326629
DOI: 10.1371/journal.pcbi.1012343

Database size positively correlates with the loss of species-level taxonomic resolution for the 16S rRNA and other prokaryotic marker genes

Seth Commichaux et al. PLoS Comput Biol. 2024.

. 2024 Aug 5;20(8):e1012343.

doi: 10.1371/journal.pcbi.1012343. eCollection 2024 Aug.

Authors

Seth Commichaux¹, Tu Luan^{2

3}, Harihara Subrahmaniam Muralidharan^{2

3}, Mihai Pop^{2

3}

Affiliations

¹ Center for Food Safety and Nutrition, Food and Drug Administration, Laurel, Maryland, United States of America.
² Department of Computer Science, University of Maryland, College Park, Maryland, United States of America.
³ Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America.

PMID: 39102435
PMCID: PMC11326629
DOI: 10.1371/journal.pcbi.1012343

Abstract

For decades, the 16S rRNA gene has been used to taxonomically classify prokaryotic species and to taxonomically profile microbial communities. However, the 16S rRNA gene has been criticized for being too conserved to differentiate between distinct species. We argue that the inability to differentiate between species is not a unique feature of the 16S rRNA gene. Rather, we observe the gradual loss of species-level resolution for other nearly-universal prokaryotic marker genes as the number of gene sequences increases in reference databases. This trend was strongly correlated with how represented a taxonomic group was in the database and indicates that, at the gene-level, the boundaries between many species might be fuzzy. Through our study, we argue that any approach that relies on a single marker to distinguish bacterial taxa is fraught even if some markers appear to be discriminative in current databases.

Copyright: This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. Workflow diagram of the analysis done for the SILVA database and the GTDB.**
The SILVA and GTDB were downloaded and sequences with incomplete taxonomic labels or from mitochondria and plastids were removed. To create the simulated databases for each marker gene, we created a collection of random subsets varying in size from 10,000 to 200,000 sequences in 10,000 gene increments. Each simulated database was clustered at 95%, 97%, 99%, and 100% identity requiring that shorter sequences fully align to longer ones.

**Fig 2. Clustering analysis for simulated databases created by randomly sampling sequences from the 16S rRNA SILVA database and the 120 marker gene Genome Taxonomy Database (GTDB).**
Each simulated database was clustered at 95%, 97%, 99%, and 100% identity requiring that shorter sequences fully align to longer ones. The 16S rRNA gene is denoted by a star in all subplots. A) The relationship between the number of genes in the simulated databases, the number of clusters, the number of multi-species clusters, and the number of sequences in multi-species clusters. For GTDB, each curve is for one of the 120 marker genes. B) The rate at which sequences were recruited to multi-species clusters as the database grows. Each point represents one of the 120 marker genes in the GTDB. C) The percentage of species with sequences in multi-species clusters. D) The relationship between the number of multi-species clusters that a species belongs to and the species richness of its genus (i.e., the total number of species from that genus) in the simulated database. This data was only taken from the final iteration of the simulated databases. The results were aggregated across all 120 marker genes in the GTDB.

**Fig 3. Workflow diagram of the analysis done for the *Listeria* marker gene simulated databases (16S rRNA and 40 marker genes).**
First, 5,014 *Listeria* draft genomes were downloaded from RefSeq and the 16S rRNA and 40 markers genes were predicted with Barnap and FetchMG, respectively. Genes that were below half or above twice as long as the mean length for a specific marker gene were removed. To create the simulated databases for each marker gene, we randomly subsampled the sequences into subsets varying in size from 1,000 to 5,000 sequences in 1,000 gene increments. We repeated this process 100 times so we could estimate the variability of our results. Each simulated database was clustered at 95%, 97%, 99%, and 100% identity requiring that shorter sequences fully align to longer ones.

**Fig 4. Clustering analysis for the simulated databases created by randomly sampling sequences from the 16S rRNA and the 40 marker genes extracted from 5,014 *Listeria* genomes.**
Each simulated database was clustered at 95%, 97%, 99%, and 100% identity requiring that shorter sequences fully align to longer ones. The results for each gene are reported by the median over 100 bootstrap experiments. The 16S rRNA gene is denoted by a star in all subplots. A) The relationship between the number of genes in the simulated databases, the number of clusters, the number of multi-species clusters, and the number of sequences in multi-species clusters. Each curve represents one of the 40 marker genes. The starred curve represents the 16S rRNA gene B) The rate at which sequences were recruited to multi-species clusters as the database grows. Each point represents one of the 40 marker genes. C) The percentage of species with sequences in multi-species clusters.

See this image and copyright information in PMC

Update of

Database size positively correlates with the loss of species-level taxonomic resolution for the 16S rRNA and other prokaryotic marker genes.
Commichaux S, Luan T, Muralidharan HS, Pop M. Commichaux S, et al. bioRxiv [Preprint]. 2023 Dec 14:2023.12.13.571439. doi: 10.1101/2023.12.13.571439. bioRxiv. 2023. Update in: PLoS Comput Biol. 2024 Aug 5;20(8):e1012343. doi: 10.1371/journal.pcbi.1012343. PMID: 38168205 Free PMC article. Updated. Preprint.

References

1. Wang Q, Garrity GM, Tiedje JM, Cole JR. Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy. Appl Environ Microbiol. 2007;73(16):5261–7. - PMC - PubMed
1. McDonald D, Price MN, Goodrich J, Nawrocki EP, DeSantis TZ, Probst A, et al. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J. 2012;6(3):610–8. doi: 10.1038/ismej.2011.139 - DOI - PMC - PubMed
1. Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, et al. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res. 2013;41(Database issue):D590–6. doi: 10.1093/nar/gks1219 - DOI - PMC - PubMed
1. Lan Y, Rosen G, Hershberg R. Marker genes that are less conserved in their sequences are useful for predicting genome-wide similarity levels between closely related prokaryotic strains. Microbiome. 2016;4(1):18. doi: 10.1186/s40168-016-0162-5 - DOI - PMC - PubMed
1. Olm MR, Crits-Christoph A, Diamond S, Lavy A, Matheus Carnevali PB, Banfield JF. Consistent Metagenome-Derived Metrics Verify and Delineate Bacterial Species Boundaries. mSystems. 2020;5(1). doi: 10.1128/mSystems.00731-19 - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

Grants and funding

R01 AI100947/AI/NIAID NIH HHS/United States

LinkOut - more resources

Full Text Sources
- PubMed Central
- Public Library of Science

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Database size positively correlates with the loss of species-level taxonomic resolution for the 16S rRNA and other prokaryotic marker genes

Affiliations

Database size positively correlates with the loss of species-level taxonomic resolution for the 16S rRNA and other prokaryotic marker genes

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Update of

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources