Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jan 26;25(1):115.
doi: 10.1186/s12864-023-09935-9.

LUSTR: a new customizable tool for calling genome-wide germline and somatic short tandem repeat variants

Collaborators, Affiliations

LUSTR: a new customizable tool for calling genome-wide germline and somatic short tandem repeat variants

Jinfeng Lu et al. BMC Genomics. .

Abstract

Background: Short tandem repeats (STRs) are widely distributed across the human genome and are associated with numerous neurological disorders. However, the extent that STRs contribute to disease is likely under-estimated because of the challenges calling these variants in short read next generation sequencing data. Several computational tools have been developed for STR variant calling, but none fully address all of the complexities associated with this variant class.

Results: Here we introduce LUSTR which is designed to address some of the challenges associated with STR variant calling by enabling more flexibility in defining STR loci, allowing for customizable modules to tailor analyses, and expanding the capability to call somatic and multiallelic STR variants. LUSTR is a user-friendly and easily customizable tool for targeted or unbiased genome-wide STR variant screening that can use either predefined or novel genome builds. Using both simulated and real data sets, we demonstrated that LUSTR accurately infers germline and somatic STR expansions in individuals with and without diseases.

Conclusions: LUSTR offers a powerful and user-friendly approach that allows for the identification of STR variants and can facilitate more comprehensive studies evaluating the role of pathogenic STR variants across human diseases.

Keywords: Bioinformatics; LUSTR; Short tandem repeats; Somatic; Variant calling tool kit.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
LUSTR pipeline and modules. LUSTR distinguishes itself from other existing pipelines or tools in the following aspects: (1) A “finder” module to standardize extraction of genomic STR regions to be genotyped. The “finder” module aims to simplify the information required to target specific STRs, diminish the impact of imperfect input, and provide flexibility to allow easier user customized target lists, ranging from unbiased compilations for genome-wide scans or a small number of targeted STR sequences. (2) Instead of directly processing mapped reads (.bam files) obtained from alignment pipelines not necessarily optimized for STRs, LUSTR de novo remaps the raw reads (.fastq files) to STR references defined in the “finder” module, with parameters adjusted specifically for STR calling to enhance the performance. We provide an “extractor” module to retrieve reads from the.bam file if raw reads are not available. This remapping step and the “finder” module, indicated by a dashed rectangle, are unique to LUSTR and are not available in other STR calling pipelines. (3) LUSTR implements a flexible two-step strategy for STR genotyping, separating the local realignment step in the “realigner” module to incorporate reads that may have been discarded during the mapping process, and a freestanding calling step in the “caller” module which processes the realignment results to estimate the genotypes for each STR. This modular approach allows for precise tracking of reads through realignment which is critical for debugging and performance evaluation, and allows easier implementation of necessary updates or incorporation of project specific optimization. (4) The “realigner” module applies both flanking-guided and repeat-guided realignment to ensure both accuracy and sensitivity. (5) The “caller” module allows fractional multiallelic STR genotyping results amenable to the calling of germline or somatic expansions or contractions. (6) LUSTR minimizes the prerequisites and only requires pre-installations of samtools and bwa
Fig. 2
Fig. 2
LUSTR is robust in tests with simulated libraries. To test the performance of LUSTR in size and allele fraction estimations, we generated simulated reads from C9orf72 locus including 2X1000bp flanking regions and the repeats of (a) homozygous alleles with different expanded or contracted repeat sizes (ranging from -10.3 to + 1000), and (b) heterozygous alleles with one reference allele and one expanded allele (+ 100 repeats), mixed by different fractions. Reads 150 nucleotides in length were generated in pairs with an error rate of 0.5% including mismatches, insertions, and deletions, under different average coverage ranging from 1 to 100X. Each combination was repeated 10 times as a group. The number of failed libraries in each group, which were due to low coverage and mainly for 1X coverage condition, is indicated by red shade. For successfully called libraries, we examined the estimated repeat size variants (a) and then estimated the fraction of the reference allele (b). The observed and expected are shown for each scenario evaluated. We compared the average result in each group (indicated by a black solid line) with the expectation (indicated by a blue dotted line) and calculated the square of correlation coefficient (r2). Among the sizes evaluated, we specifically tested the repeat size variations for the deletion allele (-10.3), reference allele (0), and allele with repeat sizes close to reads length (+ 15) in Fig. 2a. For size estimation (a), LUSTR showed robust performance starting from 5X coverage and became very close to the expectations from as low as only 10X coverage. For fraction estimation (b) LUSTR required higher coverage, but still exhibited reliable estimates showing the expected allelic ratio with only 10X coverage. This result showed that LUSTR robustly infers both repeat size and allele fraction estimations even for low coverage libraries

References

    1. Tautz D, Schlötterer C. Simple sequences. Curr Opin Genet Dev. 1994;4(6):832–837. doi: 10.1016/0959-437x(94)90067-1. - DOI - PubMed
    1. Fan H, Chu JY. A brief review of short tandem repeat mutation. Genomics Proteomics Bioinformatics. 2007;5(1):7–14. doi: 10.1016/S1672-0229(07)60009-6. - DOI - PMC - PubMed
    1. Hamada H, Petrino MG, Kakunaga T. A novel repeated element with Z-DNA-forming potential is widely found in evolutionarily diverse eukaryotic genomes. Proc Natl Acad Sci U S A. 1982;79(21):6465–6469. doi: 10.1073/pnas.79.21.6465. - DOI - PMC - PubMed
    1. Tautz D, Renz M. Simple sequences are ubiquitous repetitive components of eukaryotic genomes. Nucleic Acids Res. 1984;12(10):4127–4138. doi: 10.1093/nar/12.10.4127. - DOI - PMC - PubMed
    1. van Belkum A, Scherer S, van Alphen L, Verbrugh H. Short-sequence DNA repeats in prokaryotic genomes. Microbiol Mol Biol Rev. 1998;62(2):275–93. doi: 10.1128/MMBR.62.2.275-293.1998. - DOI - PMC - PubMed