Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Dec 14;23(1):257.
doi: 10.1186/s13059-022-02826-4.

STRling: a k-mer counting approach that detects short tandem repeat expansions at known and novel loci

Affiliations

STRling: a k-mer counting approach that detects short tandem repeat expansions at known and novel loci

Harriet Dashnow et al. Genome Biol. .

Abstract

Expansions of short tandem repeats (STRs) cause many rare diseases. Expansion detection is challenging with short-read DNA sequencing data since supporting reads are often mapped incorrectly. Detection is particularly difficult for "novel" STRs, which include new motifs at known loci or STRs absent from the reference genome. We developed STRling to efficiently count k-mers to recover informative reads and call expansions at known and novel STR loci. STRling is sensitive to known STR disease loci, has a low false discovery rate, and resolves novel STR expansions to base-pair position accuracy. It is fast, scalable, open-source, and available at: github.com/quinlan-lab/STRling .

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
STRling uses several types of read evidence to infer STR location and size. A STRling performs k-mer counting in reads that are soft-clipped, unaligned, or aligned to a large STR in the reference genome. For each k-mer of length 2–6 bp, STRling selects the one that covers the largest proportion of the read. If two are equal, the smallest is chosen. B Where a pair of reads has one read that maps well to the reference genome, and a mate with high STR content, the mapping position of the well-mapped read is used to reposition the STR read. These “anchored pairs” aid in refining the location and improve the quantification of sequence support for the putative STR. C Different classes of reads are used to support STR alleles of varying length. Small alleles, shorter than the read length, can be detected by spanning reads, and typically have many spanning pairs. Medium expansions, of a length between the read length and the fragment size, are indicated by anchored pairs and few spanning pairs. Soft-clipped reads can be used to infer the precise insertion point. Large expansions, those longer than the fragment size, are evidenced by a larger number of anchored pairs, as well as contributing unplaced pairs
Fig. 2
Fig. 2
STRling joint calling workflow. Index: STRling creates an index of the reference genome, recording the genomic coordinates where large STRs are observed. These regions act as STR “sinks”, collecting repetitive reads. Any reads mapping to these regions, in addition to unmapped reads, are candidates to have arisen from a large STR expansion. Extract: STRling counts k-mers to find high STR-content reads, then checks the mate to move the read to its correct position. Merge: read evidence is combined across individuals to increase the accuracy and uniformity of candidate STR expansion loci. Call: STRling estimates the allele sizes using the k-mer count across all reads assigned to a given locus in a linear model. Outlier: STRling checks the distribution across all individuals at a given locus, and tests for outliers
Fig. 3
Fig. 3
STRling shows superior position accuracy at known pathogenic loci. STRling and ExpansionHunter Denovo (EHdn) were run on PCR-free Illumina WGS of 134 subjects with known STR disease status, 94 of which had alleles of pathogenic size (those plotted here). STRling was run on an individual genome “Individual calling” or on all 134 genomes together “Joint calling.” EHdn was run with all affected genomes together in outlier mode “EHdn affected vs. affected”, or each of the true positives was run in outlier mode with a set of 260 unaffected individuals from 1000 genomes “EHdn affected vs. controls.” A locus was considered found if an STR expansion with the pathogenic repeat unit was reported within 500bp of the true locus. Max position error is the position difference between the known and predicted locus (max of upstream and downstream). Zero indicates the predicted position is within or at the bounds of the known locus. STRling was able to detect the true locus position to base pair accuracy for most loci, with greater accuracy using joint-calling, with greater accuracy than ExpansionHunter Denovo under all conditions tested
Fig. 4
Fig. 4
Allele size estimates from STRling compared with PCR estimates (log-log scale). STR allele size estimates from 103 individuals also assayed with PCR. “Expanded” includes all pathogenic allele sizes, in both affected individuals and carriers. “Normal” indicates non-pathogenic alleles. The black line indicates x = y, equality between STRling and PCR allele size estimates

References

    1. Gemayel R, Vinces MD, Legendre M, Verstrepen KJ. Variable tandem repeats accelerate evolution of coding and regulatory sequences. Annu Rev Genet. 2010;44:445–477. doi: 10.1146/annurev-genet-072610-155046. - DOI - PubMed
    1. Depienne C, Mandel JL. 30 years of repeat expansion disorders: What have we learned and what are the remaining challenges? Am J Hum Genet. 2021;108(5):764–785. doi: 10.1016/j.ajhg.2021.03.011. - DOI - PMC - PubMed
    1. Hannan AJ. Tandem repeats mediating genetic plasticity in health and disease. Nat Rev Genet. 2018;19(5):286–298. doi: 10.1038/nrg.2017.115. - DOI - PubMed
    1. Mitra I, Huang B, Mousavi N, Ma N, Lamkin M, Yanicky R, et al. Patterns of de novo tandem repeat mutations and their role in autism. Nature. 2021;589(7841):246–250. doi: 10.1038/s41586-020-03078-7. - DOI - PMC - PubMed
    1. Trost B, Engchuan W, Nguyen CM, Thiruvahindrapuram B, Dolzhenko E, Backstrom I, et al. Genome-wide detection of tandem DNA repeats that are expanded in autism. Nature. 2020;586(7827):80–86. doi: 10.1038/s41586-020-2579-z. - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources