Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Sep 15;38(18):4423-4425.
doi: 10.1093/bioinformatics/btac528.

The K-mer File Format: a standardized and compact disk representation of sets of k-mers

Affiliations

The K-mer File Format: a standardized and compact disk representation of sets of k-mers

Yoann Dufresne et al. Bioinformatics. .

Abstract

Summary: Bioinformatics applications increasingly rely on ad hoc disk storage of k-mer sets, e.g. for de Bruijn graphs or alignment indexes. Here, we introduce the K-mer File Format as a general lossless framework for storing and manipulating k-mer sets, realizing space savings of 3-5× compared to other formats, and bringing interoperability across tools.

Availability and implementation: Format specification, C++/Rust API, tools: https://github.com/Kmer-File-Format/.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Structure of the K-mer File Format with k =10 and minimizers of size 8. Top right part: a toy k-mer set shown in plain text. Left part: The same k-mer set is represented in KFF. The top-left box is the file header and each following boxes are different sections. Bottom right part: alternatively, a Sequences section can be represented more succinctly by a Minimizer section which contains the same set of k-mers. For example, the first entry in the M section has sequence ACTG with its minimizer at position 3, hence it corresponds to sequence ACTAAACTGATG of size 12 (which is identical to the first entry in the R section), from which three k-mers can be extracted

References

    1. Bankevich A. et al. (2012) SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol., 19, 455–477. - PMC - PubMed
    1. Břinda K. et al. (2021) Simplitigs as an efficient and scalable representation of de Bruijn graphs. Genome Biol., 22, 1–24. - PMC - PubMed
    1. Chikhi R. et al. (2021) Data structures to represent a set of k-long DNA sequences. ACM Comput. Surv., 54, 1–22.
    1. Cock P.J. et al. (2015) Sam/bam format v1. 5 extensions for de novo assemblies. BioRxiv, page 020024. 10.1101/020024. - DOI
    1. Deorowicz S. et al. (2013) Disk-based k-mer counting on a PC. BMC Bioinformatics, 14, 160. - PMC - PubMed

Publication types