Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 May 20;1(1):16.
doi: 10.1038/s43705-021-00017-z.

Automated analysis of genomic sequences facilitates high-throughput and comprehensive description of bacteria

Affiliations

Automated analysis of genomic sequences facilitates high-throughput and comprehensive description of bacteria

Thomas C A Hitch et al. ISME Commun. .

Abstract

The study of microbial communities is hampered by the large fraction of still unknown bacteria. However, many of these species have been isolated, yet lack a validly published name or description. The validation of names for novel bacteria requires that the uniqueness of those taxa is demonstrated and their properties are described. The accepted format for this is the protologue, which can be time-consuming to create. Hence, many research fields in microbiology and biotechnology will greatly benefit from new approaches that reduce the workload and harmonise the generation of protologues.We have developed Protologger, a bioinformatic tool that automatically generates all the necessary readouts for writing a detailed protologue. By producing multiple taxonomic outputs, functional features and ecological analysis using the 16S rRNA gene and genome sequences from a single species, the time needed to gather the information for describing novel taxa is substantially reduced. The usefulness of Protologger was demonstrated by using three published isolate collections to describe 34 novel taxa, encompassing 17 novel species and 17 novel genera, including the automatic generation of ecologically and functionally relevant names. We also highlight the need to utilise multiple taxonomic delineation methods, as while inconsistencies between each method occur, a combined approach provides robust placement. Protologger is open source; all scripts and datasets are available, along with a webserver at www.protologger.de.

PubMed Disclaimer

Conflict of interest statement

T.C. has ongoing scientific collaborations with Cytena GmbH and HiPP GmbH and is member of the scientific advisory board of Savanna Ingredients GmbH.

Figures

Fig. 1
Fig. 1. Simplified overview of Protologger.
The key steps within Protologger are highlighted with the tools utilised for each step indicated (in brackets), along with the quality assurance steps. Sections are coloured according to the information they provide with taxonomic placement (in yellow), ecology (in blue), and functionality (in red). The ‘validity check’ stage in taxonomic assignment involves the removal of taxa without validly published names from genomic comparison.
Fig. 2
Fig. 2. Overview of the ecological databases used within Protologger.
a Representation of the diverse environments from which MAGs originate from in the ecological analysis. For each environment,,–, the number of MAGs is stated, along with a pie chart indicating the three most prevalent bacterial phyla (see colour code in the figure), as determined by GTDB-Tk cross-referenced with LPSN. MAGs termed as ‘generic’ due to a lack of metadata are not included (n = 3397). b Phylum level taxonomic diversity within the IMNGS amplicon studies utilised within the 16S rRNA gene amplicon-based habitat preference and distribution analysis. These datasets span 63 phyla represented by over 37,314,233 OTUs. The names of phyla lacking a child taxon with a validly published name are in red, as determined via the LPSN database.
Fig. 3
Fig. 3. Comparison of taxonomic delineation methods.
a Pairwise comparisons between the dDDH and ANI values obtained for 70 isolates and their closest relatives, as identified by Protologger (n = 1599). Red lines highlight published boundaries for species delineation: dDDH, <70%; ANI, <95%. b Pairwise comparisons to test the consistency of genus-level delineation parameters (n = 30,247). For each comparison, four groups were formed based on how each method assigned the paired genomes: congruent results, same genus (Cong. Same), congruent results, different genera (Cong. Dif.) and those uniquely identified as belonging to different genera according to the method specified (Uniq. Dif.). See colour code in the figure.
Fig. 4
Fig. 4. Uncovering and describing taxonomic novelty using Protologger.
All non-redundant species-level isolates from three large collections were processed: the human bacterial collection (HBC), the Broad Institute-OpenBiome Microbiome Library (BIO-ML) and the Hungate1000 collection. a Each collection contained novel taxa, representing either undescribed species or genera. b Phylum level diversity of the undescribed isolates. c Phylogenomic tree of the novel HBC isolates described and named. For some species, multiple strains were identified; therefore, the type strain DSM number is in bold (see protologues). Isolates matching HMP ‘most wanted’ species are highlighted with green balls at the branch tips with the size representing priority. The external rings represent isolate specific information as follows: (i) the inner ring highlights the novelty, either species or genus; (ii) the centre ring indicates to which family the isolates are assigned; (iii) the outer ring shows the prevalence of each isolate across 1000 human gut amplicon samples (the ecosystem of origin of the isolates), with values ranging from 1.0–69.6%.
Fig. 5
Fig. 5. Quality and novelty within a MAG dataset from the mouse intestine.
a Novelty of the 484 iMGMC MAGs according to their 16S rRNA gene sequence similarity to their closest relative. b Comparison of the MAG and isolate collections error warnings per input generated by Protologger. White dots show the median number of errors while the dark grey bar highlights the interquartile range and the black line indicates the lower/upper adjacent values. c Phylogenomic tree of all 484 MAGs with rings on the outside highlighting in black the occurrence of Protologger warnings: chimeric 16S rRNA gene sequences, incomplete 16S rRNA gene sequences, incomplete genomes, contaminated genomes. The MAGs with no warning, hence deemed of high quality, are indicated by green bars.

References

    1. Parte AC. LPSN - List of prokaryotic names with standing in nomenclature (Bacterio.net), 20 years on. Int. J. Syst. Evol. Microbiol. 2018;68:1825–1829. - PubMed
    1. Lagkouvardos I, et al. The Mouse Intestinal Bacterial Collection (miBC) provides host-specific insight into cultured diversity and functional potential of the gut microbiota. Nat. Microbiol. 2016;1:16131. - PubMed
    1. Seshadri R, et al. Cultivation and sequencing of rumen microbiome members from the Hungate1000 Collection. Nat. Biotechnol. 2018;36:359–367. - PMC - PubMed
    1. Bai Y, et al. Functional overlap of the Arabidopsis leaf and root microbiota. Nature. 2015;528:364–369. - PubMed
    1. Diakite A, et al. Extensive culturomics of 8 healthy samples enhances metagenomics efficiency. PLoS One. 2019;14:1–12. - PMC - PubMed