Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Nov 9;19(11):e3001421.
doi: 10.1371/journal.pbio.3001421. eCollection 2021 Nov.

Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences

Affiliations

Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences

Grace A Blackwell et al. PLoS Biol. .

Abstract

The open sharing of genomic data provides an incredibly rich resource for the study of bacterial evolution and function and even anthropogenic activities such as the widespread use of antimicrobials. However, these data consist of genomes assembled with different tools and levels of quality checking, and of large volumes of completely unprocessed raw sequence data. In both cases, considerable computational effort is required before biological questions can be addressed. Here, we assembled and characterised 661,405 bacterial genomes retrieved from the European Nucleotide Archive (ENA) in November of 2018 using a uniform standardised approach. Of these, 311,006 did not previously have an assembly. We produced a searchable COmpact Bit-sliced Signature (COBS) index, facilitating the easy interrogation of the entire dataset for a specific sequence (e.g., gene, mutation, or plasmid). Additional MinHash and pp-sketch indices support genome-wide comparisons and estimations of genomic distance. Combined, this resource will allow data to be easily subset and searched, phylogenetic relationships between genomes to be quickly elucidated, and hypotheses rapidly generated and tested. We believe that this combination of uniform processing and variety of search/filter functionalities will make this a resource of very wide utility. In terms of diversity within the data, a breakdown of the 639,981 high-quality genomes emphasised the uneven species composition of the ENA/public databases, with just 20 of the total 2,336 species making up 90% of the genomes. The overrepresented species tend to be acute/common human pathogens, aligning with research priorities at different levels from individual interests to funding bodies and national and global public health agencies.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Species composition of the 639,981 high-quality assemblies.
(A) Relative proportions of species to the data as a pie chart. Note that 90% of the assemblies are from 20 bacterial species. (B) Fraction of assemblies covered by accumulating bacterial species. (C) Tracking proportions of the top 10 bacterial species for each year. The data underlying this figure may be found in https://doi.org/10.6084/m9.figshare.16437939.
Fig 2
Fig 2
Number of AMR genes in individual genomes of the orders (A) Bacilli and (B) Gammaproteobacteria. Genera in bold contain species that are in the top 20 represented species in the 661K snapshot. Arrows above indicate genera that contain species that have been determined by WHO to be of critical (red), high (orange), and medium (yellow) priority pathogens for research and development into new antibiotics [29]. The Actinobacteria order is not shown as it does not contain a member of WHO priority pathogen list. The data underlying this figure may be found in https://doi.org/10.6084/m9.figshare.16437939. AMR, antimicrobial resistance.
Fig 3
Fig 3
Predicted AMR profiles of species from (A) Bacilli, (B) Gammaproteobacteria, and (C) Actinobacteria, showing the number of predicted antimicrobial classes each isolate is resistant to, based on genetic profile. The red line indicates the threshold for MCR (predicted resistance to 3 classes of antimicrobials or more). Species are classed as MCR (red in figure) if at least 50% of the assemblies are MCR. Species included have at least 10 assemblies. The data underlying this figure may be found in https://doi.org/10.6084/m9.figshare.16437939. AMR, antimicrobial resistance; MCR, multi-class resistant.

References

    1. Blaxter M, Danchin A, Savakis B, Fukami-Kobayashi K, Kurokawa K, Sugano S, et al.. Reminder to deposit DNA sequences. Science. 2016;352:780. doi: 10.1126/science.aaf7672 - DOI - PubMed
    1. Magoc T, Pabinger S, Canzar S, Liu X, Su Q, Puiu D, et al.. GAGE-B: an evaluation of genome assemblers for bacterial organisms. Bioinformatics. 2013;29:1718–25. doi: 10.1093/bioinformatics/btt273 - DOI - PMC - PubMed
    1. Haft DH, DiCuccio M, Badretdin A, Brover V, Chetvernin V, O’Neill K, et al.. RefSeq: an update on prokaryotic genome annotation and curation. Nucleic Acids Res. 2018;46:D851–60. doi: 10.1093/nar/gkx1068 - DOI - PMC - PubMed
    1. Tatusova T, DiCuccio M, Badretdin A, Chetvernin V, Nawrocki EP, Zaslavsky L, et al.. NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res. 2016;44:6614–24. doi: 10.1093/nar/gkw569 - DOI - PMC - PubMed
    1. Markowitz VM, Chen I-MA, Palaniappan K, Chu K, Szeto E, Pillay M, et al.. IMG 4 version of the integrated microbial genomes comparative analysis system. Nucleic Acids Res. 2014;42:D560–7. doi: 10.1093/nar/gkt963 - DOI - PMC - PubMed

Publication types