Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Feb 5:9:e10805.
doi: 10.7717/peerj.10805. eCollection 2021.

Syncmers are more sensitive than minimizers for selecting conserved k‑mers in biological sequences

Affiliations

Syncmers are more sensitive than minimizers for selecting conserved k‑mers in biological sequences

Robert Edgar. PeerJ. .

Abstract

Minimizers are widely used to select subsets of fixed-length substrings (k-mers) from biological sequences in applications ranging from read mapping to taxonomy prediction and indexing of large datasets. The minimizer of a string of w consecutive k-mers is the k-mer with smallest value according to an ordering of all k-mers. Syncmers are defined here as a family of alternative methods which select k-mers by inspecting the position of the smallest-valued substring of length s < k within the k-mer. For example, a closed syncmer is selected if its smallest s-mer is at the start or end of the k-mer. At least one closed syncmer must be found in every window of length (k - s) k-mers. Unlike a minimizer, a syncmer is identified by its sequence alone, and is therefore synchronized in the following sense: if a given k-mer is selected from one sequence, it will also be selected from any other sequence. Also, minimizers can be deleted by mutations in flanking sequence, which cannot happen with syncmers. Experiments on minimizers with parameters used in the minimap2 read mapper and Kraken taxonomy prediction algorithm respectively show that syncmers can simultaneously achieve both lower density and higher conservation compared to minimizers.

Keywords: Alignment-free methods; Minimizers; Sequence analysis; String index; k-mers.

PubMed Disclaimer

Conflict of interest statement

The author declares that he has no competing interests.

Figures

Figure 1
Figure 1. Closed syncmers.
Construction of k = 5, s = 2 closed syncmers with lexicographic coding. A k-mer is a closed syncmer if its smallest s-mer is at the beginning or end of the k-mer sequence. Consider a window of three k-mers (length L = 2ks − 1 = 7 letters) with the sequence shown in (A). The smallest s-mer is AA (orange background). (B) Shows the six s-mers in the sequence in (A). Each s-mer is shown with a gray background in the k-mer where it appears in the first or last position. (B) Illustrates that every s-mer in the sequence shown in (A) appears at the start or end of a k-mer. Therefore, regardless of which s-mer has the smallest value, there is a k-mer in the window for which this s-mer appears at the first or last position. In this example, AA appears at the end of GGCAA, marked with an asterisk (*) and GGCAA is therefore a syncmer. This shows that every window of length L must contain at least one syncmer. Note that while flanking sequence is shown in the figure, GGCAA is recognized as a syncmer from its sequence alone because its smallest 2-mer appears at the end. Closed syncmers tend to form pairs spaced at the maximum possible distance (ks) as illustrated in (C). (D) Illustrates how k = 5, s = 2 closed syncmers are identified in a longer string. The smallest s-mer in each k-mer is shaded with a color. Blue background indicates that the smallest s-mer is not at the start or end; if it does appears at the start or end then it has an orange background and the k-mer is a closed syncmer (indicated by an asterisk).
Figure 2
Figure 2. Frequency distribution of distances between consecutive submers.
The histograms show spacing distributions for some representative submer types with k = 8 including (A–C) minimizers, (D–F) modulo submers, (G–I) closed syncmers, (J–L) closed rotated syncmers, (M–O) open syncmers, (P–R) open rotated syncmers, (S–U) open syncmers with offset and (V–X) downsampled closed syncmers. Parameters, for example, s (sub-sequence length) and t (offset), are shown at the top of each chart, together with c (compression) and Cn (fraction of conserved letters at 90% identity) for every chart.
Figure 3
Figure 3. Conservation as a function of identity.
The histograms show fraction of conserved letters at identities from 80% to 90% for closed syncmers and minimizers comparable to those used in minimap2 (A) and Kraken (B). Notice that the large majority of submers are deleted at these identities.

References

    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of Molecular Biology. 1990;215(3):403–410. doi: 10.1016/S0022-2836(05)80360-2. - DOI - PubMed
    1. Ekim B, Berger B, Orenstein Y. A randomized parallel algorithm for efficiently finding near-optimal universal hitting sets. In: Schwartz R, editor. Research in Computational Molecular Biology: RECOMB 2020—Lecture Notes in Computer Science. Vol. 12074. Cham: Springer; 2020. - PMC - PubMed
    1. Gilbert JA, Dupont CL. Microbial metagenomics: beyond the genome. Annual Review of Marine Science. 2011;3(1):347–371. doi: 10.1146/annurev-marine-120709-142811. - DOI - PubMed
    1. Jain C, Rhie A, Zhang H, Chu C, Walenz BP, Koren S, Phillippy AM. Weighted minimizer sampling improves long read mapping. Bioinformatics. 2020;36(Suppl. 1):i111–i118. doi: 10.1093/bioinformatics/btaa435. - DOI - PMC - PubMed
    1. Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34(18):3094–3100. doi: 10.1093/bioinformatics/bty191. - DOI - PMC - PubMed