Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jun 1;39(6):btad388.
doi: 10.1093/bioinformatics/btad388.

WarpSTR: determining tandem repeat lengths using raw nanopore signals

Affiliations

WarpSTR: determining tandem repeat lengths using raw nanopore signals

Jozef Sitarčík et al. Bioinformatics. .

Abstract

Motivation: Short tandem repeats (STRs) are regions of a genome containing many consecutive copies of the same short motif, possibly with small variations. Analysis of STRs has many clinical uses but is limited by technology mainly due to STRs surpassing the used read length. Nanopore sequencing, as one of long-read sequencing technologies, produces very long reads, thus offering more possibilities to study and analyze STRs. Basecalling of nanopore reads is however particularly unreliable in repeating regions, and therefore direct analysis from raw nanopore data is required.

Results: Here, we present WarpSTR, a novel method for characterizing both simple and complex tandem repeats directly from raw nanopore signals using a finite-state automaton and a search algorithm analogous to dynamic time warping. By applying this approach to determine the lengths of 241 STRs, we demonstrate that our approach decreases the mean absolute error of the STR length estimate compared to basecalling and STRique.

Availability and implementation: WarpSTR is freely available at https://github.com/fmfi-compbio/warpstr.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
Overview of the WarpSTR analysis.
Figure 2.
Figure 2.
A finite-state automaton modeling a DM1 STR locus (CAG) and its flanking sequences (shown in gray).
Figure 3.
Figure 3.
Extended finite-state automaton over the k-mer space.
Figure 4.
Figure 4.
A part of the warping path produced by the WarpSTR search algorithm using the state automaton for DM1 locus. States representing k-mers are shown on the left, while the part of the dynamic programming matrix is shown on the right. The warping path, i.e. the path through states with the lowest cost is shown as the sequence of lines with corresponding nucleotides.
Figure 5.
Figure 5.
Signal polishing effect on the alignment. Example alignment of the signal to the expected signal from the state automaton before (top) and after polishing (bottom). Before polishing, the normalized signal values are much higher than the expected signal values, and some of these differences decrease after polishing. More importantly, a spurious repeat in the highlighted windows disappears after polishing.
Figure 6.
Figure 6.
MAE for WarpSTR and basecalling for individual loci colored by repeating pattern length. 25 loci that have very high MAE in both methods are not shown; these are most likely due to large expansions not captured by VCF callers in the gold standard.
Figure 7.
Figure 7.
Clustered predictions of DM2 for NA24385 subject split per repeat unit.

References

    1. Andrew SE, Goldberg YP, Theilmann J. et al. A CCG repeat polymorphism adjacent to the CAG repeat in the Huntington disease gene: implications for diagnostic accuracy and predictive testing. Hum Mol Genet 1994;3:65–7. - PubMed
    1. Bahlo M, Bennett MF, Degorski P. et al. Recent advances in the detection of repeat expansions with short-read next-generation sequencing. F1000Res 2018;7:736. - PMC - PubMed
    1. Bellman R, Kalaba R.. On adaptive control processes. IRE Trans Automat Contr 1959;4:1–9.
    1. Budiš J, Kucharík M, Ďuriš F. et al. Dante: genotyping of known complex and expanded short tandem repeats. Bioinformatics 2019;35:1310–7. - PubMed
    1. Dashnow H, Lek M, Phipson B. et al. STRetch: detecting and discovering pathogenic short tandem repeat expansions. Genome Biol 2018;19:121. - PMC - PubMed

Publication types