Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 May 1;34(5):1167-1182.
doi: 10.1093/molbev/msx066.

Efficient Inference of Recent and Ancestral Recombination within Bacterial Populations

Affiliations

Efficient Inference of Recent and Ancestral Recombination within Bacterial Populations

Rafal Mostowy et al. Mol Biol Evol. .

Abstract

Prokaryotic evolution is affected by horizontal transfer of genetic material through recombination. Inference of an evolutionary tree of bacteria thus relies on accurate identification of the population genetic structure and recombination-derived mosaicism. Rapidly growing databases represent a challenge for computational methods to detect recombinations in bacterial genomes. We introduce a novel algorithm called fastGEAR which identifies lineages in diverse microbial alignments, and recombinations between them and from external origins. The algorithm detects both recent recombinations (affecting a few isolates) and ancestral recombinations between detected lineages (affecting entire lineages), thus providing insight into recombinations affecting deep branches of the phylogenetic tree. In simulations, fastGEAR had comparable power to detect recent recombinations and outstanding power to detect the ancestral ones, compared with state-of-the-art methods, often with a fraction of computational cost. We demonstrate the utility of the method by analyzing a collection of 616 whole-genomes of a recombinogenic pathogen Streptococcus pneumoniae, for which the method provided a high-resolution view of recombination across the genome. We examined in detail the penicillin-binding genes across the Streptococcus genus, demonstrating previously undetected genetic exchanges between different species at these three loci. Hence, fastGEAR can be readily applied to investigate mosaicism in bacterial genes across multiple species. Finally, fastGEAR correctly identified many known recombination hotspots and pointed to potential new ones. Matlab code and Linux/Windows executables are available at https://users.ics.aalto.fi/~pemartti/fastGEAR/ (last accessed February 6, 2017).

Keywords: Streptococcus pneumoniae; antibiotic resistance; bacterial population genetics; hidden Markov models; population structure; recombination detection.

PubMed Disclaimer

Figures

F<sc>ig</sc>. 1
Fig. 1
Simulations of bacterial recombinations. The diagram shows the underlying simulation method, and here the case of P = 2 populations is considered: blue and red. Populations were simulated under a clonal model of evolution for a given set of parameters (see Methods section). Three types of recombinations were then simulated using the clonal alignment. Ancestral recombinations (case 1) occurred before the most recent common ancestor of both populations, and thus were present in all isolates of the recipient lineage. Intermediate recombinations (case 2) occurred sometime between the time when populations emerged and present time (t = 0), and thus were typically present in multiple isolates. Recent recombinations (case 3) occurred in the last few generations, and thus were typically present in few isolates. To clarify, our method identifies events of type 1 as ancestral recombinations, whereas all other recombinations, affecting less than any whole lineage (cases 2 and 3), are inferred as multiple recent recombinations present in multiple individual strains.
F<sc>ig</sc>. 2
Fig. 2
Hidden Markov models to detect recombination. (A) Hidden Markov model used for identifying lineages and inferring ancestral recombinations. Each column represents a polymorphic site in the alignment and rows represent strains. The observed states of the chain are the allele frequencies within each cluster (in the case of identifying lineages) or lineage (in the case of identifying ancestral recombinations). The latent states of the chain represent identity of allele frequencies in the two lineages at the polymorphic sites. (B) Hidden Markov model used for identifying recent recombinations. The observed states are nucleotide values observed in the target strain and the latent states are possible origins of the nucleotides. The possible origins include all observed lineages plus an unknown origin.
F<sc>ig</sc>. 3
Fig. 3
Visual assessment of the inferred population genetic structure. The figure shows the population genetic structure of the simulated data. In each panel, the rows correspond to sequences, columns correspond to positions in the alignment and colors show different populations. The left column shows the simulated, true structure although the right column shows the population genetic structure inferred by fastGEAR. The order of the sequences in both columns is identical, and the colors are assigned randomly, thus populations are in the same order (1, 2, 3) but can be of different color on the left and on the right. Three figure rows correspond to three different simulation scenarios: only recent recombinations (top), only ancestral recombinations (middle), and all three types of recombinations (bottom). The following parameters were used in the simulations: P = 3, n = 20, Ne = 50, T=2e4,μ=2e6, L = 10kb (all rows); Γr=800 and Rr = 5, Ri=Ra=0 (top panel); Γa=800 and Ra = 3, Rr=Ri=0 (middle panel); Γa=500,Γr=500 and Ra = 3, Ri = 4, and Rr = 6 (bottom panel).
F<sc>ig</sc>. 4
Fig. 4
Comparison of fastGEAR and other recombination detection methods. The figure shows the performance of fastGEAR compared with other methods: structure, Gubbins and ClonalFrameML. Top row shows results for recent, middle row for intermediate, and bottom row for ancestral simulated recombinations. Both recent and intermediate simulated recombinations were detected by fastGEAR in the same way as “recent” recombinations. The left column shows the false detection rate, namely, the mean number of false-positive recent recombinations per strain (top/middle) and ancestral recombinations per alignment (bottom). The middle column shows the proportion of true recombinations detected, and the right column shows the proportion of the total recombination length detected. Horizontal axis shows the between-population distance per 100bp (simulated by varying T between 1.0e3 and 2.0e4). Different lines show performance of different approaches. structure was run for 400,000 generations (200,000 burn-in), with true populations set as prior and with three independent chains to test for convergence. ClonalFrameML was conditioned on the true tree topology. Magenta line shows results of fastGEAR run on the full alignment, and red line for fastGEAR run lineage-by-lineage. Each point represents the average of ten simulations. The following parameters were used in the simulations: P = 3, n = 30, μ=2.0 e–6, L = 20kb, Σr=300,Σa=600 and Rr = 5, Ri=Ra=5.
F<sc>ig</sc>. 5
Fig. 5
Results showing inter-species recombination at penicillin-binding proteins. The figure shows fastGEAR results for combined data sets with the 616 S. pneumoniae strains and some number of additional sequences from other species (104 in pbp1a, 129 in pbp2b, 127 in pbp2x). The phylogeny and the sequence clusters (SCs) on the left show the core-genome-based tree with 15 major monophyletic clusters for the S. pneumoniae strains. Strains from other species are shown on top of the S. pneumoniae strains. The species annotation is represented by colors on the left side of the additional strains, above the phylogeny. Note that the colors used to annotate species are independent of the colors in the fastGEAR output plots, where colors represent lineages detected by fastGEAR, except white which denotes gaps in the alignment.
F<sc>ig</sc>. 6
Fig. 6
Comparison of fastGEAR and BratNextGen The figure presents a detailed comparison of fastGEAR and BratNextGen. The results are shown for the pbp2x gene, zooming in to the sequences from multiple different species appearing on top in figure 5. We see that fastGEAR is able to detect mosaic structure between species.
F<sc>ig</sc>. 7
Fig. 7
Population structure of the pneumococcal data. The phylogeny and the sequence clusters (SCs) on the left show the core-genome-based tree with 15 major monophyletic clusters. Middle panel shows fastGEAR output for 25 out of 96 housekeeping genes, as discussed in the text; results for all 96 genes are qualitatively the same and shown in supplementary figure S19. The colors represent different lineages identified in the analysis (but are otherwise selected arbitrarily to be easily distinguishable). Recent and ancestral recombinations are colored with the color of the donor lineage. The results for the different genes were obtained by running fastGEAR independently, but the lineage colors at different genes were reordered to reduce the number of colors for any single strain across the genes (See Supplementary text for details). White colour denotes missing data. The PSA matrix on the right shows the genome-wide proportion of shared ancestry between the isolates in the data set, ranging from blue (distant) to yellow (closely related).

References

    1. Bernardo JM, Smith AF.. 2001. Bayesian theory. Bristol, UK: IOP Publishing.
    1. Bishop C. 2006. Pattern recognition and machine learning, Vol. 4 New York: Springer.
    1. Britton RA, Lin DCH, Grossman AD.. 1998. Characterization of a prokaryotic SMC protein involved in chromosome partitioning. Genes Develop. 129:1254–1259. - PMC - PubMed
    1. Brown T, Didelot X, Wilson DJ, De Maio N.. 2016. SimBac: simulation of whole bacterial genomes with homologous recombination. Microb Genom. 21. doi: 10.1099/mgen.0.000044. - PMC - PubMed
    1. Corander J, Marttinen P.. 2006. Bayesian identification of admixture events using multilocus molecular markers. Mol Ecol. 1510:2833–2843. - PubMed

Publication types