Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jun;19(6):705-710.
doi: 10.1038/s41592-022-01457-8. Epub 2022 Apr 1.

Long-read mapping to repetitive reference sequences using Winnowmap2

Affiliations

Long-read mapping to repetitive reference sequences using Winnowmap2

Chirag Jain et al. Nat Methods. 2022 Jun.

Abstract

Approximately 5-10% of the human genome remains inaccessible due to the presence of repetitive sequences such as segmental duplications and tandem repeat arrays. We show that existing long-read mappers often yield incorrect alignments and variant calls within long, near-identical repeats, as they remain vulnerable to allelic bias. In the presence of a nonreference allele within a repeat, a read sampled from that region could be mapped to an incorrect repeat copy. To address this limitation, we developed a new long-read mapping method, Winnowmap2, by using minimal confidently alignable substrings. Winnowmap2 computes each read mapping through a collection of confident subalignments. This approach is more tolerant of structural variation and more sensitive to paralog-specific variants within repeats. Our experiments highlight that Winnowmap2 successfully addresses the issue of allelic bias, enabling more accurate downstream variant calls in repetitive sequences.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
a. Illustration of allelic bias in near-identical genomic repeats. Paralog-specific variants (PSVs), indicated using colored dot and triangle markers, denote variation between two repeat copies in an ancestral human genome . Mutations in the reference sequence are indicated using ‘x’ markers. Long reads can be mapped to an incorrect repeat copy if the best mapping is decided by pairwise sequence alignment score. b. MCAS alignments map to correct loci on the reference. An MCAS is a carefully selected substring of a read. By excluding non-reference alleles, this approach reduces allelic bias. c. A different example is used to illustrate MCAS computation starting from a particular position in a read. To compute MCAS starting from a particular position in a read, we look for the shortest substring that can be uniquely mapped to a reference. Uniqueness of an alignment is determined by using its mapping quality score.
Figure 2:
Figure 2:
Visualization of alignment pileup near the mutated bases of chromosome 8 by using IGV tool [47]. The sky-blue-colored track on top of each plot shows mapping-coverage using a uniform y-axis scale (0-50). The grey-colored line segments show individual primary read alignments. IGV uses purple markers to indicate presence of indels within read alignments. NGMLR, minimap2, graphmap show reduced coverage due to allelic bias whereas Winnowmap2 shows expected coverage in this region. Consistent large insertions in the middle of each plot are distinctly visible due to simulated SV.
Figure 3:
Figure 3:
False negative and false positive rates achieved by SV calls of four mapping methods: Winnowmap2, Winnowmap, minimap2 and NGMLR. The top two plots show accuracy statistics over T2T chromosomes 8 and X whereas the bottom two plots show the statistics within only the most repetitive intervals of these chromosomes. Winnowmap2 alignments enabled the most accurate Sniffles SV calls with the least FNR and FPR scores. Note that y-axis scales differ in these plots.
Figure 4:
Figure 4:
Wall-clock time and memory usage of four mapping methods. Each method was executed using 24 threads on an Intel Xeon processor with 24 physical cores. Y-axis of the above plots is log-scaled.
Figure 5:
Figure 5:
Comparison of Winnowmap2 and minimap2 by using GIAB SV benchmark set defined for HG002 human sample. Current GIAB benchmark set (v0.6) excludes complex repeats of the human genome. Outside the repeats, Winnowmap2 achieves similar FNR scores and slightly better FPR scores compared to minimap2.
Figure 6:
Figure 6:
The left plots indicate the size distribution of SVs computed by Winnowmap2-Sniffles pipeline using HG004 and HG007 samples. Here we used both GRCh38 and T2T CHM13 human assembly as reference. The right plot shows the positional density of SVs found in HG004 sample using an ideogram plot [48] of the T2T CHM13 human assembly (v1.0). Significant enrichment of structural variation occurs in unique and newly resolved repetitive portions of the assembly.

References

    1. Miga KH, Koren S, Rhie A, Vollger MR, Gershman A, Bzikadze A, et al. : Telomere-to-telomere assembly of a complete human X chromosome. Nature (2020) - PMC - PubMed
    1. Logsdon GA, Vollger MR, Hsieh P, Mao Y, Liskovykh MA, Koren S, Nurk S, Mercuri L, Dishuck PC, Rhie A, et al. : The structure, function and evolution of a complete human chromosome 8. Nature pp. 1–7 (2021) - PMC - PubMed
    1. Chaisson MJ, Tesler G: Mapping single molecule sequencing reads using basic local alignment with successive refinement (blasr): application and theory. BMC bioinformatics 13(1), 238 (2012) - PMC - PubMed
    1. Sović I, Šikić M, Wilm A, Fenlon SN, Chen S, Nagarajan N: Fast and sensitive mapping of nanopore sequencing reads with graphmap. Nature communications 7(1), 1–11 (2016) - PMC - PubMed
    1. Lin HN, Hsu WL: Kart: a divide-and-conquer algorithm for NGS read alignment. Bioinformatics 33(15), 2281–2287 (2017) - PMC - PubMed

Publication types