panX: pan-genome analysis and exploration

Wei Ding¹, Franz Baumdicker², Richard A Neher^{1

3}

Affiliations

¹ Max Planck Institute for Developmental Biology, 72076 Tübingen, Germany.
² Mathematisches Institut, Albert-Ludwigs University of Freiburg, 79104 Freiburg, Germany.
³ Biozentrum and SIB Swiss Institute of Bioinformatics, University of Basel, 4056 Basel, Switzerland.

PMID: 29077859
PMCID: PMC5758898
DOI: 10.1093/nar/gkx977

panX: pan-genome analysis and exploration

Wei Ding et al. Nucleic Acids Res. 2018.

. 2018 Jan 9;46(1):e5.

doi: 10.1093/nar/gkx977.

Authors

Wei Ding¹, Franz Baumdicker², Richard A Neher^{1

3}

Affiliations

¹ Max Planck Institute for Developmental Biology, 72076 Tübingen, Germany.
² Mathematisches Institut, Albert-Ludwigs University of Freiburg, 79104 Freiburg, Germany.
³ Biozentrum and SIB Swiss Institute of Bioinformatics, University of Basel, 4056 Basel, Switzerland.

PMID: 29077859
PMCID: PMC5758898
DOI: 10.1093/nar/gkx977

Abstract

Horizontal transfer, gene loss, and duplication result in dynamic bacterial genomes shaped by a complex mixture of different modes of evolution. Closely related strains can differ in the presence or absence of many genes, and the total number of distinct genes found in a set of related isolates-the pan-genome-is often many times larger than the genome of individual isolates. We have developed a pipeline that efficiently identifies orthologous gene clusters in the pan-genome. This pipeline is coupled to a powerful yet easy-to-use web-based visualization for interactive exploration of the pan-genome. The visualization consists of connected components that allow rapid filtering and searching of genes and inspection of their evolutionary history. For each gene cluster, panX displays an alignment, a phylogenetic tree, maps mutations within that cluster to the branches of the tree and infers gain and loss of genes on the core-genome phylogeny. PanX is available at pangenome.de. Custom pan-genomes can be visualized either using a web server or by serving panX locally as a browser-based application.

PubMed Disclaimer

Figures

**Figure 1.**
panX analysis pipeline. PanX uses DIAMOND (24) and MCL (26,27) to identify clusters of homologous genes from a collection of annotated genomes. These clusters are then analyzed phylogenetically and split into orthologous groups based on the tree structure. The graph on the right shows the time required to identify orthologous clusters in pan-genomes of different size on a compute node with 64 cores. The naive all-against-all comparison with DIAMOND scales quadratically with the number of genomes (blue line, ‘DIAMOND & MCL [all-against-all]'). The “divide and conquer” strategy where clustering is first applied to batches of sequences and batches are subsequently clustered (see text) reduces this scaling to approximately linear (green line). Tree building and post-processing take about as long as the clustering itself for pan-genomes of 500 genomes.

**Figure 2.**
Accuracy of clustering by different tools. The fraction of mis-clustered genes increased with diversity of the pan-genome. We ran Roary with options -i 70 and -i 50. At low diversity, panX and Roary (-i 70) performed with similar accuracy and mis-clustered about 1 in 1000 genes. At high diversities, all tools showed similar accuracy and mis-clustered 1 in 10 genes. Results for tools designed for high diversity data sets (OrthoMCL and OrthoFinder) are only shown for diversities above 0.02. Similarly, results for Roary are suppressed at high diversity to improve clarity of the graph.

**Figure 3.**
Type of mis-clustering by tool and gene frequency. The histograms show the fraction of wrongly merged (red) and wrongly split (blue) clusters by gene frequency and clustering tool across 5 simulated datasets with exponentially distributed substitution rates with mean rate μ = 1/15.

**Figure 4.**
Pan-genome statistics. Panels A and C show the distribution of the number of strains represented in pan-genomes of 33 *S. pneumoniae* and 40 *Prochlorococcus* strains constructed by panX, Roary, OrthoFinder, OrthoMCL and PanOCT (the last two tools are only available for the smaller *Prochlorococcus* data set). To obtain these graphs, clusters are sorted by descending number of strains represented in the cluster. This number is then plotted against the rank of the sorted clusters. The point where the lines drop below the number of strains marks the size of the core genome. PanX, OrthoFinder and OrthoMCL largely agree on the cluster size distribution, the number of core genes and the total size of the pan-genome (with ∼10% variation). Roary agrees with the latter tools if the identity cut-off is chosen appropriately, while PanOCT estimates a very small core genome and an extremely large number gene clusters. Panels B and D show the degree to which different pan-genome tools agree with each other. Each row shows the fraction of clusters identified by one tool, which exactly match the clusters identified by another tool. Analogous results for simulated data are given in Supplementary Figure S6.

**Figure 5.**
Interconnected components of the panX web application. The top panels provide a statistical characterization of the pan-genome and allow filtering of gene clusters by abundance and gene length. The gene cluster table below is searchable and sortable and allows the user to select individual gene clusters for closer inspection. Upon selection in the table, the alignment of gene cluster is loaded into the viewer on the center right, the gene tree is loaded into the tree viewer at the bottom right, and presence/absence patterns of this gene cluster are mapped onto the core genome tree at the bottom left. The example shows the gene coding for the penicillin binding protein Pbp2x and the color indicates the susceptibility to benzylpenicillin.

**Figure 6.**
Linked core genome and gene trees. The core genome tree shows the strains in which the current gene is present or absent. Placing the mouse over an internal node in one of the trees (upper clade of the gene tree on the right in this example) highlights all strains in the corresponding clade in both trees. This gives the user a rapid impression of phylogenetic incongruence and likely gene gain and loss events.

See this image and copyright information in PMC

References

1. Soucy S.M., Huang J., Gogarten J.P.. Horizontal gene transfer: building the web of life. Nat. Rev. Genet. 2015; 16:472–482. - PubMed
1. Thomas C.M., Nielsen K.M.. Mechanisms of, and barriers to, horizontal gene transfer between bacteria. Nat. Rev. Micro. 2005; 3:711–721. - PubMed
1. Puigbò P., Lobkovsky A.E., Kristensen D.M., Wolf Y.I., Koonin E.V.. Genomes in turmoil: quantification of genome dynamics in prokaryote supergenomes. BMC Biol. 2014; 12:66. - PMC - PubMed
1. Vernikos G., Medini D., Riley D.R., Tettelin H.. Ten years of pan-genome analyses. Curr. Opin. Microbiol. 2015; 23:148–154. - PubMed
1. Lapierre P., Gogarten J.P.. Estimating the size of the bacterial pan-genome. Trends Genet. 2009; 25:107–110. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

panX: pan-genome analysis and exploration

Affiliations

panX: pan-genome analysis and exploration

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources