Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Jul 19:2024.07.16.600745.
doi: 10.1101/2024.07.16.600745.

TRGT-denovo: accurate detection of de novo tandem repeat mutations

Affiliations

TRGT-denovo: accurate detection of de novo tandem repeat mutations

T Mokveld et al. bioRxiv. .

Abstract

Motivation: Identifying de novo tandem repeat (TR) mutations on a genome-wide scale is essential for understanding genetic variability and its implications in rare diseases. While PacBio HiFi sequencing data enhances the accessibility of the genome's TR regions for genotyping, simple de novo calling strategies often generate an excess of likely false positives, which can obscure true positive findings, particularly as the number of surveyed genomic regions increases.

Results: We developed TRGT-denovo, a computational method designed to accurately identify all types of de novo TR mutations-including expansions, contractions, and compositional changes-within family trios. TRGT-denovo directly interrogates read evidence, allowing for the detection of subtle variations often overlooked in variant call format (VCF) files. TRGT-denovo improves the precision and specificity of de novo mutation (DNM) identification, reducing the number of de novo candidates by an order of magnitude compared to genotype-based approaches. In our experiments involving eight rare disease trios previously studiedTRGT-denovo correctly reclassified all false positive DNM candidates as true negatives. Using an expanded repeat catalog, it identified new candidates, of which 95% (19/20) were experimentally validated, demonstrating its effectiveness in minimizing likely false positives while maintaining high sensitivity for true discoveries.

Availability and implementation: Built in Rust, TRGT-denovo is available as source code and a pre-compiled Linux binary along with a user guide at: https://github.com/PacificBiosciences/trgt-denovo.

PubMed Disclaimer

Conflict of interest statement

Competing interests T. Mokveld, E. Dolzhenko, Z. Kronenberg, and M. A. Eberle are employees and shareholders of Pacific Biosciences.

Figures

Fig 1.
Fig 1.. Overview of TRGT-denovo
(full details in Methods). (a) TRGT pre-processing, which requires aligned PacBio HiFi reads, a repeat definition catalog, and a reference genome. (b) TRGT-denovo uses TRGT output, specifically spanning reads and genotyping data, along with the reference genome and repeat definitions. (c) By matching repeat definitions and corresponding allele sequences, reads are partitioned and assigned to alleles. This is achieved via TRGT-obtained classifications, consensus allele alignment, or phasing, thus determining the allele sequence each read best supports. (d) Allele partitioned reads are realigned to child allele consensus sequences for comparison purposes. (e) Potential DNMs are identified by examining discrepancies in alignment score distributions among candidate de novo alleles.
Fig 2.
Fig 2.. TRGT-denovo metrics.
De novo coverage relative to the (a) allele de novo ratio; (b) child de novo ratio; (c) mean absolute difference between the reads with de novo evidence and Pu. Each point represents a potential de novo allele. Horizontal and vertical lines indicate thresholds for minimal de novo coverage, allele de novo ratio, and a range for the child de novo ratio, creating shaded boxes where true de novo mutations are more likely.
Fig 4.
Fig 4.. Alignment score distributions.
Distributions of alignment scores for reads spanning alleles (M0, M1, F0, F1, C0, C1) when aligned to alleles C0 (a) and C1 (b). WFA alignment scores range from negative, less similar, to zero, perfect match. Inheritance patterns, as inferred from surrounding genetic variation, are: M0C0(inherited) and F1C1 (inherited + de novo). Symbols PU and CL denote the parental upper bound and candidate de novo allele lower bound respectively. Each point corresponds to an individual read aligned to C0 or C1. Red-outlined points highlight two observations: a read from F1, exceeding CL, showing overlap with C1, and a read from C1 that falls below PU, and overlaps with F1. In C1 there are 19 reads, of which 18 exceed PU, contributing to the de novo coverage.

References

    1. English A, Dolzhenko E, Jam HZ, Mckenzie S, Olson ND, De Coster W, et al. Benchmarking of small and large variants across tandem repeats. bioRxiv. 2023. doi:10.1101/2023.10.29.564632 - DOI - PMC - PubMed
    1. Subramanian S, Mishra RK, Singh L. Genome-wide analysis of microsatellite repeats in humans: their abundance and density in specific genomic regions. Genome Biol. 2003;4: R13. - PMC - PubMed
    1. Depienne C, Mandel J-L. 30 years of repeat expansion disorders: What have we learned and what are the remaining challenges? Am J Hum Genet. 2021;108: 764–785. - PMC - PubMed
    1. Erwin GS, Gürsoy G, Al-Abri R, Suriyaprakash A, Dolzhenko E, Zhu K, et al. Recurrent repeat expansions in human cancer genomes. Nature. 2023;613: 96–102. - PMC - PubMed
    1. Verbiest MA, Lundström O, Xia F, Baudis M, Bilgin Sonay T, Anisimova M. Short tandem repeat mutations regulate gene expression in colorectal cancer. Sci Rep. 2024;14: 3331. - PMC - PubMed

Publication types

LinkOut - more resources