Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2024 May 21:23:2289-2303.
doi: 10.1016/j.csbj.2024.05.025. eCollection 2024 Dec.

A survey of k-mer methods and applications in bioinformatics

Affiliations
Review

A survey of k-mer methods and applications in bioinformatics

Camille Moeckel et al. Comput Struct Biotechnol J. .

Abstract

The rapid progression of genomics and proteomics has been driven by the advent of advanced sequencing technologies, large, diverse, and readily available omics datasets, and the evolution of computational data processing capabilities. The vast amount of data generated by these advancements necessitates efficient algorithms to extract meaningful information. K-mers serve as a valuable tool when working with large sequencing datasets, offering several advantages in computational speed and memory efficiency and carrying the potential for intrinsic biological functionality. This review provides an overview of the methods, applications, and significance of k-mers in genomic and proteomic data analyses, as well as the utility of absent sequences, including nullomers and nullpeptides, in disease detection, vaccine development, therapeutics, and forensic science. Therefore, the review highlights the pivotal role of k-mers in addressing current genomic and proteomic problems and underscores their potential for future breakthroughs in research.

Keywords: K-mers; Neomers; Nullomers; Nullpeptides; Primes; Sequence Analysis.

PubMed Disclaimer

Conflict of interest statement

All authors declare that they have no conflicts of interest.

Figures

ga1
Fig. 1: Introduction to k-mers. A. All possible 2-mers, or k-mers with two nucleotides, are listed. In a specific DNA sequence, all 2-mers are recorded for frequency analysis. B. Nullomers, or possible 2-mers not in the genome, are counted by subtracting the observed 2-mers from all possible 2-mers. Nullpeptides are k-mers missing from proteomes. C. In a mutated sequence, neomers, or nullomers that resurface due to somatic mutations, can occur. AA is a neomer in this mutated sequence. D. When analyzing multiple genomes or sequences, primes, k-mers not present in any of the sequences, can be identified. There is one prime (CC) in these three sequences. Quasi-primes, or k-mers that only occur in one sequence (AA), can be identified.
Fig. 1
Fig. 1
Introduction to k-mers. A. All possible 2-mers, or k-mers with two nucleotides, are listed. In a specific DNA sequence, all 2-mers are recorded for frequency analysis. B. Nullomers, or possible 2-mers not in the genome, are counted by subtracting the observed 2-mers from all possible 2-mers. Nullpeptides are k-mers missing from proteomes. C. In a mutated sequence, neomers, or nullomers that resurface due to somatic mutations, can occur. AA is a neomer in this mutated sequence. D. When analyzing multiple genomes or sequences, primes, k-mers not present in any of the sequences, can be identified. There is one prime (CC) in these three sequences. Quasi-primes, or k-mers that only occur in one sequence (AA), can be identified.
Fig. 2
Fig. 2
Applications of k-mers. A. K-mer counting and frequency analysis are crucial steps in various bioinformatic applications, including detecting sample contamination. B. K-mers are used in graph-based genome assembly and identification of genetic variants. C. In sequence assembly, k-mers are utilized for sequence alignment. Sequencing reads are fragmented into k-mers, and overlaps between k-mers are identified to reconstruct the original sequence. D. In adaptive sequencing, k-mer counting allows for the identification of unique sequences. K-mer based variant-filtering methods are used for improving accuracy in genome assembly; algorithms will filter false positives from alignments. E. K-mers are utilized in genome editing to identify suitable target sites and design guide RNAs, and efficient k-mer indexing enables primer candidates to be identified with low off-target site potential. F. K-mers are used for taxonomic profiling and classification in comparative genomics and metagenomics.
Fig. 3
Fig. 3
Applications of absent sequences. A. The nullomer profile of individuals may be used to characterize populations. B. Nullomers have been detected in cell-free DNA and utilized for cancer detection. C. Immunogenicity is associated with nullomers and nullpeptides, and both have been used in vaccines to increase efficacy and decrease autoimmune reactions. D. Certain nullpeptides exhibit cytotoxic effects and have cancer-killing properties. E. Nullomers and primes can be used for barcoding purposes to label or differentiate samples, as they are absent from one or more organisms. F. Quasi-primes can serve as species-specific biomarkers. Their use as universal fingerprints has potential for real-time organismal detection, understanding evolution as well as for biosecurity.

References

    1. Slatko B.E., Gardner A.F., Ausubel F.M. Overview of next-generation sequencing technologies. Curr. Protoc. Mol. Biol. 2018;122 - PMC - PubMed
    1. Hu T., Chitnis N., Monos D., Dinh A. Next-generation sequencing technologies: an overview. Hum. Immunol. 2021;82:801–811. - PubMed
    1. Dai X., Shen L. Advances and trends in omics technology development. Front. Med. 2022;9 - PMC - PubMed
    1. Koumakis L. Deep learning models in genomics; are we there yet? Comput. Struct. Biotechnol. J. 2020;18:1466–1473. - PMC - PubMed
    1. D’Argenio V. The high-throughput analyses era: are we ready for the data struggle? High-Throughput. 2018;7:8. - PMC - PubMed