Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Nov 1;40(11):btae629.
doi: 10.1093/bioinformatics/btae629.

k-nonical space: sketching with reverse complements

Affiliations

k-nonical space: sketching with reverse complements

Guillaume Marçais et al. Bioinformatics. .

Abstract

Motivation: Sequences equivalent to their reverse complements (i.e. double-stranded DNA) have no analogue in text analysis and non-biological string algorithms. Despite this striking difference, algorithms designed for computational biology (e.g. sketching algorithms) are designed and tested in the same way as classical string algorithms. Then, as a post-processing step, these algorithms are adapted to work with genomic sequences by folding a k-mer and its reverse complement into a single sequence: The canonical representation (k-nonical space).

Results: The effect of using the canonical representation with sketching methods is understudied and not understood. As a first step, we use context-free sketching methods to illustrate the potentially detrimental effects of using canonical k-mers with string algorithms not designed to accommodate for them. In particular, we show that large stretches of the genome ("sketching deserts") are undersampled or entirely skipped by context-free sketching methods, effectively making these genomic regions invisible to subsequent algorithms using these sketches. We provide empirical data showing these effects and develop a theoretical framework explaining the appearance of sketching deserts. Finally, we propose two schemes to accommodate for these effects: (i) a new procedure that adapts existing sketching methods to k-nonical space and (ii) an optimization procedure to directly design new sketching methods for k-nonical space.

Availability and implementation: The code used in this analysis is available under a permissive license at https://github.com/Kingsford-Group/mdsscope.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
The left-most tie breaking rule implies a relaxed window guarantee for syncmers when ts/2. In this example, k=9,t=2,s=5, and the alphabetic order on 5-mers is used. The 9-mer is not selected because the 5-mer at position 2 is equal to the 5-mer at position 0 and the left-most tie breaking rule. Because this 5-mer overlaps with itself, it is “almost” repetitive: It is a sesqui-power xny with x=AC,y=A,n=2. Repetitive sequences of length t =2 are skipped over.
Figure 2.
Figure 2.
For k =6 and σ  =  2, examples of PCRs. Every k-mer on the left side is a canonical k-mer (C), and k-mers symmetrical compared to the vertical line are reverse complement of each other. The k-mers inside gray boxes are an example of set φ, and the k-mers in dashed boxes are the corresponding φc set. 000010 is in both φ and φc (case a). Not every PCR is covered by φc, and consequently φc is not decycling (cases b and c).

References

    1. Axler S. Linear Algebra Done Right. Springer Nature, 2015.
    1. Champarnaud J-M, Hansel G, Perrin D.. Unavoidable sets of constant length. Int J Algebra Comput 2004;14:241–51. 10.1142/S0218196704001700 - DOI
    1. DeBlasio D, Gbosibo F, Kingsford C. et al. Practical universal k-mer sets for minimizer schemes. In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB ’19, pp. 167–76. New York, NY, USA: ACM, 2019.
    1. Dutta A, Pellow D, Shamir R.. Parameterized syncmer schemes improve long-read mapping. PLoS Comput Biol 2022;18:e1010638. 10.1371/journal.pcbi.1010638 - DOI - PMC - PubMed
    1. Edgar R. Syncmers are more sensitive than minimizers for selecting conserved k-mers in biological sequences. PeerJ 2021;9:e10805. 10.7717/peerj.10805 - DOI - PMC - PubMed