Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jun;9(6):mgen001034.
doi: 10.1099/mgen.0.001034.

PanGraph: scalable bacterial pan-genome graph construction

Affiliations

PanGraph: scalable bacterial pan-genome graph construction

Nicholas Noll et al. Microb Genom. 2023 Jun.

Abstract

The genomic diversity of microbes is commonly parameterized as SNPs relative to a reference genome of a well-characterized, but arbitrary, isolate. However, any reference genome contains only a fraction of the microbial pangenome, the total set of genes observed in a given species. Reference-based approaches are thus blind to the dynamics of the accessory genome, as well as variation within gene order and copy number. With the widespread usage of long-read sequencing, the number of high-quality, complete genome assemblies has increased dramatically. In addition to pangenomic approaches that focus on the variation in the sets of genes present in different genomes, complete assemblies allow investigations of the evolution of genome structure and gene order. This latter problem, however, is computationally demanding with few tools available that shed light on these dynamics. Here, we present PanGraph, a Julia-based library and command line interface for aligning whole genomes into a graph. Each genome is represented as a path along vertices, which in turn encapsulate homologous multiple sequence alignments. The resultant data structure succinctly summarizes population-level nucleotide and structural polymorphisms and can be exported into several common formats for either downstream analysis or immediate visualization.

Keywords: graphs; microbial diversity; pangenome.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they are not aware of relevant conflicts of interests.

Figures

Fig. 1.
Fig. 1.
Overview of the PanGraph algorithm. (a) The alignment graph is constructed progressively by aligning graphs pairwise up a guide tree constructed from neighbour-joining the minimizer overlap between strains. (b) During pairwise alignment, pancontigs (blue and green) are merged by identifying homologous intervals (shown in yellow). If the underlying alignments are viewed as compatible, i.e. the energy is less than 0, the pancontigs are merged.
Fig. 2.
Fig. 2.
Algorithm performance. PanGraph scales linearly with the number of input genomes. This is a direct result of the guide tree simplification. The solid line and ribbons display the mean and standard deviation over 50 runs. All runs were performed utilizing eight cores, and with the default minimap2 alignment kernel and asm20 option.
Fig. 3.
Fig. 3.
Accuracy against synthetic data. We generated artificial data with varying degree of sequence divergence, and compared the real underlying pangenome graph with the one reconstructed by PanGraph, for three different alignment kernels: minimap2 with asm10 or asm20 option, and mmseqs2. In each comparison we evaluated the misplacement of breakpoints that we can pair on the two graphs within 1 kb. The plot displays the fraction of breakpoints that have misplacement greater than the standard PanGraph precision threshold of L min=100 bp, as a function of average pairwise sequence divergence. Line and shaded area represent mean and standard deviation over 25 repetitions. mmseqs2 maintains accuracy at higher divergence, at the cost of higher computational time.
Fig. 4.
Fig. 4.
Benchmark on real data. We built pangenome graphs from fully assembled chromosomes from five different bacterial species. For each species we built graphs with three different alignment kernel options (minimap2 with asm10 or asm20 options and mmseqs2) and two different settings for the pseudo-energy parameters α and β (standard or null values). (a) PanGraph wall-time when run in parallel on eight cores. (b) Fraction of core pangenome in the pangenome graph. (c) Sequence compression, defined as the ratio between the pangenome graph size and the cumulative size of all the sequences contained in the graph. Since maximal compression depends on the number of isolates n in the pangenome graph, we mark for reference the value of 1/n for each dataset.
Fig. 5.
Fig. 5.
Pangenome graph properties vs. dataset size. We built pangenome graphs with an increasing number of isolates from the E. coli dataset and measured the scaling of different properties of the graphs. Graphs were built using the minimap2 alignment kernel with asm20 option. Lines and shaded areas represent the mean and standard deviation over 10 different repetitions on random subsets of isolates, except for the final point indicating the full graph (307 isolates). (a) Number of pancontigs in the graph. We count the total number of pancontigs (blue), the number of core pancontigs (orange) and the minimum number of pancontigs that contain more than 50 % of the pangenome (L50, black). (b) Average size of pancontigs (blue), of only core pancontigs (orange), and size of the smallest pancontig in the minimal set that spans 50 % of the pangenome (N50, black). (c) Cumulative size of all genomes in the pangenome graph (grey), total pangenome size (blue) and size of the core pangenome (orange).
Fig. 6.
Fig. 6.
Test of graph marginalization. (a) We built a pangenome graph from 50 randomly chosen strains from the K. pneumoniae dataset. We then randomly picked 50 pairs of strains. For each pair we compared the pangenome graph obtained by marginalizing the complete graph on the pair of strains, and the one obtained by building a new graph for the pair (top). The comparison is done by considering that each graph partitions a genome in shared and private segments. By combining the partitions generated by the marginalized and pairwise graphs we categorize segments in three categories, depending on whether the two partitions agree or not, and if they agree depending on whether segments are shared or private. All graphs were built using the minimap2 alignment kernel with asm20 option and default value for the energy parameters. (b) Distribution of the average fraction of the genome covered by segments of each category, over the 50 pairs considered (two entries per pair). The last line represents the distribution of shared sequence, approximated using the fraction of shared k-mers corrected using sequence divergence as described in the main text. Next to each distribution we report its mean and standard deviation. (c) Distribution of average segment lengths for each category over the 50 pairs considered (two entries per pair). Mean and standard deviation are reported.

References

    1. Arnold BJ, Huang I-T, Hanage WP. Horizontal gene transfer and adaptive evolution in bacteria. Nat Rev Microbiol. 2022;20:206–218. doi: 10.1038/s41579-021-00650-4. - DOI - PubMed
    1. Sakoparnig T, Field C, van Nimwegen E. Whole genome phylogenies reflect the distributions of recombination rates for many bacterial species. Elife. 2021;10:e65366. doi: 10.7554/eLife.65366. - DOI - PMC - PubMed
    1. Touchon M, Perrin A, de Sousa JAM, Vangchhia B, Burn S, et al. Phylogenetic background and habitat drive the genetic diversification of Escherichia coli . PLoS Genet. 2020;16:e1008866. doi: 10.1371/journal.pgen.1008866. - DOI - PMC - PubMed
    1. Touchon M, Hoede C, Tenaillon O, Barbe V, Baeriswyl S, et al. Organised genome dynamics in the Escherichia coli species results in highly diverse adaptive paths. PLoS Genet. 2009;5:e1000344. doi: 10.1371/journal.pgen.1000344. - DOI - PMC - PubMed
    1. Doolittle WF, Zhaxybayeva O. On the origin of prokaryotic species. Genome Res. 2009;19:744–756. doi: 10.1101/gr.086645.108. - DOI - PubMed

Publication types

LinkOut - more resources