Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Oct 3;39(10):btad621.
doi: 10.1093/bioinformatics/btad621.

PanKmer: k-mer-based and reference-free pangenome analysis

Affiliations

PanKmer: k-mer-based and reference-free pangenome analysis

Anthony J Aylward et al. Bioinformatics. .

Abstract

Summary: Pangenomes are replacing single reference genomes as the definitive representation of DNA sequence within a species or clade. Pangenome analysis predominantly leverages graph-based methods that require computationally intensive multiple genome alignments, do not scale to highly complex eukaryotic genomes, limit their scope to identifying structural variants (SVs), or incur bias by relying on a reference genome. Here, we present PanKmer, a toolkit designed for reference-free analysis of pangenome datasets consisting of dozens to thousands of individual genomes. PanKmer decomposes a set of input genomes into a table of observed k-mers and their presence-absence values in each genome. These are stored in an efficient k-mer index data format that encodes SNPs, INDELs, and SVs. It also includes functions for downstream analysis of the k-mer index, such as calculating sequence similarity statistics between individuals at whole-genome or local scales. For example, k-mers can be "anchored" in any individual genome to quantify sequence variability or conservation at a specific locus. This facilitates workflows with various biological applications, e.g. identifying cases of hybridization between plant species. PanKmer provides researchers with a valuable and convenient means to explore the full scope of genetic variation in a population, without reference bias.

Availability and implementation: PanKmer is implemented as a Python package with components written in Rust, released under a BSD license. The source code is available from the Python Package Index (PyPI) at https://pypi.org/project/pankmer/ as well as Gitlab at https://gitlab.com/salk-tm/pankmer. Full documentation is available at https://salk-tm.gitlab.io/pankmer/.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
PanKmer enables the rapid estimation of relatedness across the pangenome as well as analysis of specific loci. (A) Schematic of procedure for constructing the k-mer index. In the example, each genome G0–G2 is decomposed into canonical 3-mers. Each 3-mer is equivalent to its reverse complement, and the lexicographically first is the canonical form. Each 3-mer is assigned an integer value, and its presence/absence is recorded for each genome in the index. (B) Relatedness heatmap of Typha pangenome, ANI values shown. (C) Genome anchoring plots of representative contigs in TD01. Average k-mer conservation of 100-kb bins shown, where k-mer conservation is the fraction of TD or TL genomes that include each k-mer along the contig.

References

    1. Aggarwal SK, Singh A, Choudhary M. et al. Pangenomics in microbial and crop research: progress, applications, and perspectives. Genes (Basel) 2022;13:598. - PMC - PubMed
    1. Almodaresi F, Sarkar H, Srivastava A. et al. A space and time-efficient index for the compacted colored De Bruijn graph. Bioinformatics 2018;34:i169–77. - PMC - PubMed
    1. Alonso-Blanco C, Andrade J, Becker C. et al. 1,135 genomes reveal the global pattern of polymorphism in Arabidopsis thaliana. Cell 2016;166:481–91. - PMC - PubMed
    1. Aun E, Brauer A, Kisand V. et al. A k-mer-based method for the identification of phenotype-associated genomic biomarkers and predicting phenotypes of sequenced bacteria. PLoS Comput. Biol 2018;14:e1006434. - PMC - PubMed
    1. Baaijens JA, Bonizzoni P, Boucher C. et al. Computational graph pangenomics: a tutorial on data structures and their applications. Nat Comput 2022;21:81–108. - PMC - PubMed

Publication types