Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Nov 30;9(1):5114.
doi: 10.1038/s41467-018-07641-9.

High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries

Affiliations

High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries

Chirag Jain et al. Nat Commun. .

Abstract

A fundamental question in microbiology is whether there is continuum of genetic diversity among genomes, or clear species boundaries prevail instead. Whole-genome similarity metrics such as Average Nucleotide Identity (ANI) help address this question by facilitating high resolution taxonomic analysis of thousands of genomes from diverse phylogenetic lineages. To scale to available genomes and beyond, we present FastANI, a new method to estimate ANI using alignment-free approximate sequence mapping. FastANI is accurate for both finished and draft genomes, and is up to three orders of magnitude faster compared to alignment-based approaches. We leverage FastANI to compute pairwise ANI values among all prokaryotic genomes available in the NCBI database. Our results reveal clear genetic discontinuity, with 99.8% of the total 8 billion genome pairs analyzed conforming to >95% intra-species and <83% inter-species ANI values. This discontinuity is manifested with or without the most frequently sequenced species, and is robust to historic additions in the genome databases.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Correlation of FastANI and Mash-based ANI output with ANIb values for datasets D1–D5. Because FastANI assumes a probabilistic identity cutoff that is set to 80% by default, it reports 76, 570, 4271, 464, and 130 genome matches for the individual queries in datasets D1–D5 respectively. To enable a direct quality comparison against FastANI, Mash is executed for only those pairs that are reported by FastANI. Notice that each dataset encompasses a different nucleotide identity range (x-axes). Gray line represents a straight line y = x plot for reference. Pearson correlation coefficients corresponding to these plots are listed separately in Table 2. Last plot shows error of these methods w.r.t. ANIb using all five datasets
Fig. 2
Fig. 2
Scaling up FastANI’s performance using multi-core parallel execution. We executed parallel FastANI processes on 40 physical cores, where each process was assigned an equally sized part of the reference D1–D5 databases. Left and right plots evaluate FastANI’s compute and indexing phase, respectively. FastANI achieves reasonable speedups on all datasets except the compute phase in D1 and D5, as their runtime on a single core is too small to begin with (Table 3)
Fig. 3
Fig. 3
Genetic discontinuity observed using 90 K genomes. a Histogram plot showing the distribution of ANI values among the 90 K genomes. Only ANI values in the 76–100% range are shown. Out of total 8.01 billion pairwise genome comparisons, FastANI reported only 17M ANI values (0.21%) with ANI between 83 and 95% indicating a genetic discontinuum. Multiple colors are used to show how genomes from different genera are contributing to this distribution. Few peaks in the histogram arise from genera that have been extensively sequenced and dominate the database. b Density curves of ANI values in the ANI range 76–100%. Each curve shows the density curve corresponding to the database at a particular time period. Discontinuity in all four curves is observed consistently. c. Distribution of ANI values with each comparison labeled by the nomenclature of genomes being compared. All the comparisons between Escherichia coli and Shigella spp. have been labeled separately. The 95% ANI threshold on x-axis serves as a valid classifier for comparisons belonging to same and different species
Fig. 4
Fig. 4
FastANI algorithm explained using synthetic and real examples. a Illustration of FastANI’s work-flow for computing ANI between a query genome and a reference genome. Five mappings are obtained from three query fragments using Mashmap. Mforward saves the maximum identity mapping for each query fragment. In this example, Mforward = {m2, m4, m5}. From this set, Mreciprocal picks m4 and m5 as the maximum identity mapping for each reference bin. Mapping identities of orthologous mappings, thus found in Mreciprocal, are finally averaged to compute ANI. b FastANI supports visualization of the orthologous mappings Mreciprocal that are used to estimate the ANI value using genoPlotR. In this figure, ANI is computed between Bartonella quintana strain (NC_018533.1) as query and Bartonella henselae strain (NC_005956.1) as reference. Red line segments denote the orthologous mappings computed by FastANI for ANI estimation

Comment in

References

    1. Luo C, et al. Genome sequencing of environmental escherichia coli expands understanding of the ecology and speciation of the model bacterial species. Proc. Natl Acad. Sci. 2011;108:7200–7205. doi: 10.1073/pnas.1015622108. - DOI - PMC - PubMed
    1. Shapiro BJ, et al. Population genomics of early events in the ecological differentiation of bacteria. Science. 2012;336:48–51. doi: 10.1126/science.1218198. - DOI - PMC - PubMed
    1. Goris J, et al. DNA-DNA hybridization values and their relationship to whole-genome sequence similarities. Int. J. Syst. Evol. Microbiol. 2007;57:81–91. doi: 10.1099/ijs.0.64483-0. - DOI - PubMed
    1. Konstantinidis KT, Tiedje JM. Genomic insights that advance the species definition for prokaryotes. Proc. Natl Acad. Sci. U.S.A. 2005;102:2567–2572. doi: 10.1073/pnas.0409727102. - DOI - PMC - PubMed
    1. Yarza P, et al. Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences. Nat. Rev. Microbiol. 2014;12:635. doi: 10.1038/nrmicro3330. - DOI - PubMed

Publication types