Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Feb 21:14:60.
doi: 10.1186/1471-2105-14-60.

Genome sequence-based species delimitation with confidence intervals and improved distance functions

Affiliations

Genome sequence-based species delimitation with confidence intervals and improved distance functions

Jan P Meier-Kolthoff et al. BMC Bioinformatics. .

Abstract

Background: For the last 25 years species delimitation in prokaryotes (Archaea and Bacteria) was to a large extent based on DNA-DNA hybridization (DDH), a tedious lab procedure designed in the early 1970s that served its purpose astonishingly well in the absence of deciphered genome sequences. With the rapid progress in genome sequencing time has come to directly use the now available and easy to generate genome sequences for delimitation of species. GBDP (Genome Blast Distance Phylogeny) infers genome-to-genome distances between pairs of entirely or partially sequenced genomes, a digital, highly reliable estimator for the relatedness of genomes. Its application as an in-silico replacement for DDH was recently introduced. The main challenge in the implementation of such an application is to produce digital DDH values that must mimic the wet-lab DDH values as close as possible to ensure consistency in the Prokaryotic species concept.

Results: Correlation and regression analyses were used to determine the best-performing methods and the most influential parameters. GBDP was further enriched with a set of new features such as confidence intervals for intergenomic distances obtained via resampling or via the statistical models for DDH prediction and an additional family of distance functions. As in previous analyses, GBDP obtained the highest agreement with wet-lab DDH among all tested methods, but improved models led to a further increase in the accuracy of DDH prediction. Confidence intervals yielded stable results when inferred from the statistical models, whereas those obtained via resampling showed marked differences between the underlying distance functions.

Conclusions: Despite the high accuracy of GBDP-based DDH prediction, inferences from limited empirical data are always associated with a certain degree of uncertainty. It is thus crucial to enrich in-silico DDH replacements with confidence-interval estimation, enabling the user to statistically evaluate the outcomes. Such methodological advancements, easily accessible through the web service at http://ggdc.dsmz.de, are crucial steps towards a consistent and truly genome sequence-based classification of microorganisms.

PubMed Disclaimer

Figures

Figure 1
Figure 1
An example of a hypothetical HSP layout between two genomes A and B as produced during the GBDP alignment phase. Subsequences that are part of an HSP in either A or B are labeled with small letters a-g. A special case is represented by segment “c” where both HSP 2 and HSP 3 are overlapping. GBDP’s algorithms are programmed to handle these distinctly, i.e., (i) by simply completely omitting the smaller HSP 3 (“greedy” algorithm), (ii) by omitting only segment “c”, i.e. trimming HSP 3 (“greedy-with-trimming” algorithm), or (iii) by merging information from both HSPs regarding the overlapping segment (“coverage” algorithm). (Figure redrawn from [15]).
Figure 2
Figure 2
Results of the correlation analyses between GBDP-derived distances and DDH as opposed to the correlations between ANI and DDH.A: The performance of both GBDP and ANI regarding their correlation with wet-lab DDH is shown. The boxplots visualize the correlation results for the data sets DS1-4, created for conducting fair comparisons between GBDP, the original ANI implementation [6] and JSpecies [7 ](green circles: Kendall’s τ; orange triangles: Pearson’s ρ). For the purpose of an easier visualization, the scale has been bound by 0 and -1, thus omitting a few outliers greater than 0, and the sign of correlation values involving similarities was inverted. The correlation coefficients between ANI and DDH are highlighted by horizontal lines, either dotted (DS3, ANI; DS4, ANIm), dot-dashed (DS4, ANIb) or long-dashed (DS4, Tetra). B: GBDP correlations (DS1) dependent on the alignment tools used: BLAT (BT), BLAST+ (BP), NCBI-BLAST (NB), WU-BLAST (WU), MUMmer (MU) and BLASTZ (BZ). The dotted lines represent the globally best correlation (i.e., the most negative one), and the boxplots are sorted increasingly by their most negative Kendall coefficient, i.e., the best setting can be found at the leftmost position. The same applies to C and D. C: Results for DS1 dependent on the algorithms “coverage” (COV), “greedy” (GR) and “greedy-with-trimming” (TR). D: Correlations based on DS1 dependent on distance formulae d0 - d9. For obvious reasons, the distance formulae d0, d1, d4, d6 and d7 yielded the same Kendall correlations as their logarithmized variants d2, d3, d5, d8 and d9.
Figure 3
Figure 3
Distributions of the median coefficients of variation of intergenomic distances obtained by resampling GBDP. The depicted distributions were determined by grouping the median coefficient of variation (CV) for each setting by either algorithms (left; “greedy”, gr; “greedy-with-trimming”, tr; “coverage”, cov) or formulae (right).
Figure 4
Figure 4
Juxtaposition of confidence-interval widths for both model based DDH predictions and those induced by bootstrap replicates. Distances were calculated under the selected well-performing GBDP method (see main text) either using the “Coverage” algorithm (A and C) or “Greedy-with-Trimming” (B and D). For each distance value the respective DDH predictions were made with a simple linear regression model (x-axis) and the widths of their 95% CIs determined accordingly (y-axis).
Figure 5
Figure 5
GLM with a binary response variable. The curve depicts the predictions from the model for the selected well-performing GBDP settings (see main text). The y-axis indicates the GBDP-derived probability that a DDH value is above 70%, indicating that two genomes represent organisms of the same species. The orange vertical line marks the distance threshold for species delineation as provided by the GLM, i.e., denoting a probability of 0.5. The blue vertical line marks an alternative error ratio-based distance threshold as presented in our previous article [8].
Figure 6
Figure 6
Comparison of generalized linear models and data transformations for DDH prediction. All model fits were based on distances calculated with the selected well-performing GBDP method (see main text). The models were either inferred from (i) the complete data set DS1 (red and blue curves, red circles and blue triangles) or (ii) the reduced data set [8] DS2 (blue curve and blue triangles). The green vertical line indicates the 50% probability threshold as calculated by the GLM for binary response data (see Figure 5). Left: GLMs (generalized linear models with quasi-binomial error family) based on DDH proportion data as response and untransformed distance values as predictor variable. Right: GLMs (generalized linear models with quasi-binomial error family) based on DDH as response and logarithmized distance values as predictor variable.

References

    1. Wayne LG, Brenner DJ, Colwell RR, Grimont PaD, Kandler O, Krichevsky MI, Moore LH, Moore WEC, Murray RGE, Stackebrandt E, Starr MP, Truper HG. Report of the Ad Hoc committee on reconciliation of approaches to bacterial systematics. Int J Syst Bacteriol. 1987;37(4):463–464. doi: 10.1099/00207713-37-4-463. - DOI
    1. Stackebrandt E, Goebel BM. Taxonomic note: a place for DNA-DNA reassociation and 16S rRNA sequence analysis in the present species definition in bacteriology. Int J Syst Bacteriol. 1994;44(4):846–849. doi: 10.1099/00207713-44-4-846. - DOI
    1. Schleifer K. Classification of Bacteria and Archaea: past, present and future. Syst Appl Microbiol. 2009;32(8):533–542. doi: 10.1016/j.syapm.2009.09.002. - DOI - PubMed
    1. Klenk HP, Göker M. En route to a genome-based classification of Archaea and Bacteria? Syst Appl Microbiol. 2010;33(4):175–182. doi: 10.1016/j.syapm.2010.03.003. - DOI - PubMed
    1. Vandamme P, Pot B, Gillis M. de Vos P. Polyphasic taxonomy, a consensus approach to bacterial systematics. Microbiol Rev. 1996;60(2):407–438. - PMC - PubMed