Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Feb;73(1):005707.
doi: 10.1099/ijsem.0.005707.

Collection and curation of prokaryotic genome assemblies from type strains at NCBI

Affiliations

Collection and curation of prokaryotic genome assemblies from type strains at NCBI

Sivakumar Kannan et al. Int J Syst Evol Microbiol. 2023 Feb.

Abstract

The public sequence databases are entrusted with the dual responsibility of providing an accessible archive to all submitters and supporting data reliability and its re-use to all users. Genomes from type materials can act as an unambiguous reference for a taxonomic name and play an important role in comparative genomics, especially for taxon verification or reclassification. The National Center for Biotechnology Information (NCBI) collects and curates information on prokaryotic type strains and genomes from type strains. The average nucleotide identity (ANI)-based quality control processes introduced at NCBI to verify the genomes from type strains and improve related sequence records are detailed here. Using the curated genomes from type strains as reference, the taxonomy of over 1.1 million GenBank genomes were verified and the taxonomy of over 7000 new submissions before acceptance to GenBank and over 1800 existing genomes in GenBank were reclassified.

Keywords: ANI; GenBank; genome; taxonomy; type material; type strain.

PubMed Disclaimer

Conflict of interest statement

The authors declare that there are no conflicts of interest.

Figures

Fig. 1.
Fig. 1.
Assemblies from type and/or co-identical strains from the same species were mostly similar with fewer outliers. Genomic coherence or similarity among a pair of type assemblies from the same species was measured using average nucleotide identity (ANI) and symmetric overlap (matched region length over the total length among a pair of assemblies). Assemblies were grouped by their assembly levels (see supplementary information for descriptions of assembly levels) to check if the difference in assembly levels could explain any lack of coherence among type assemblies from the same species. Table 8 lists the extreme outliers.
Fig. 2.
Fig. 2.
Examples of assemblies from species with high intraspecies genomic diversity. Intraspecies ANI vs symmetric overlap (matched region length over the total length among a pair of assemblies) of assemblies against their corresponding type assemblies from Clostridium botulinum ( C. botulinum ), Campylobacter lari ( C. lari ), Enterococcus faecium ( E. faecium ) and Listeria monocytogenes ( L. monocytogenes ). The vertical dashed line indicates the default ANI threshold, 96 %. There are many assemblies in clusters that match their corresponding type assemblies below the expected ANI threshold and/or coverage.
Fig. 3.
Fig. 3.
Adding additional representative assemblies for species with broad genomic diversity improves taxon identification. Intraspecies ANI of all assemblies from Listeria monocytogenes ( L. monocytogenes ) and Vibrio vulnificus ( V. vulnificus ) against only their corresponding type assemblies (‘excluding clade ref’) and after including additional representative assemblies (‘including clade ref’). Red circles indicate the best match ANI value of assemblies against only their type assemblies and blue circles indicate the new best match ANI value after including the additional representative assemblies. 8 % of L. monocytogenes and 99 % of V. vulnificus assemblies that previously didn’t match their type assemblies matched the newly added representative assemblies at above the expected ANI threshold.
Fig. 4.
Fig. 4.
ANI-based verification of heterotypic synonymization. When two independently described taxa were determined to be the same species, the two taxa would be merged and the taxon that was described later would become the heterotypic synonym of the taxon that was described earlier. Lower ANI values of assemblies from heterotypic synonyms (referred to as ‘syntype assemblies’ or ‘syntype’) against the assemblies from type strains (referred to as ‘type assemblies’ or ‘type’) from the same species (red circles) indicate potentially problematic synonymizations. There were at least 27 cases where the ANI values of the assemblies from heterotypic synonyms were lower than the ANI threshold of the corresponding species. The dotted vertical line indicates the default ANI threshold of 96 %.

References

    1. Karsch-Mizrachi I, Takagi T, Cochrane G. The international nucleotide sequence database collaboration. Nucleic Acids Res. 2018;46:D48–D51. doi: 10.1093/nar/gkx1097. - DOI - PMC - PubMed
    1. O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44:D733–45. doi: 10.1093/nar/gkv1189. - DOI - PMC - PubMed
    1. Schoch CL, Ciufo S, Domrachev M, Hotton CL, Kannan S, et al. NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database (Oxford) 2020;2020:baaa062. doi: 10.1093/database/baaa062. - DOI - PMC - PubMed
    1. Federhen S. Type material in the NCBI Taxonomy Database. Nucleic Acids Res. 2015;43:D1086–98. doi: 10.1093/nar/gku1127. - DOI - PMC - PubMed
    1. Ciufo S, Kannan S, Sharma S, Badretdin A, Clark K, et al. Using average nucleotide identity to improve taxonomic assignments in prokaryotic genomes at the NCBI. Int J Syst Evol Microbiol. 2018;68:2386–2392. doi: 10.1099/ijsem.0.002809. - DOI - PMC - PubMed

LinkOut - more resources