Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Jun 4:2024.06.03.597269.
doi: 10.1101/2024.06.03.597269.

ULTRA-Effective Labeling of Repetitive Genomic Sequence

Affiliations

ULTRA-Effective Labeling of Repetitive Genomic Sequence

Daniel R Olson et al. bioRxiv. .

Update in

Abstract

In the age of long read sequencing, genomics researchers now have access to accurate repetitive DNA sequence (including satellites) that, due to the limitations of short read sequencing, could previously be observed only as unmappable fragments. Tools that annotate repetitive sequence are now more important than ever, so that we can better understand newly uncovered repetitive sequences, and also so that we can mitigate errors in bioinformatic software caused by those repetitive sequences. To that end, we introduce the 1.0 release of our tool for identifying and annotating locally-repetitive sequence, ULTRA (ULTRA Locates Tandemly Repetitive Areas). ULTRA is fast enough to use as part of an efficient annotation pipeline, produces state-of-the-art reliable coverage of repetitive regions containing many mutations, and provides interpretable statistics and labels for repetitive regions. It released under an open license, and available for download at https://github.com/TravisWheelerLab/ULTRA.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
ULTRA’s HMM. The cloud shaped nodes represent a collection of states modeling both tandem repeats and also insertion/deletion events (see Section Insertion and Deletion States). Self-transition edges have not been drawn but do exist for the non-repetitive and repetitive states. Similar to tantan, ULTRA models large period repeats as being less common than small period repeats through a decay parameter, λ. For a model allowing maximum period k, the probability of transitioning from the start state to a period p repetitive state is γp=λp÷i=1kλi. All labeled parameters can be adjusted (see userguide at https://github.com/TravisWheelerLab/ULTRA).
Figure 2.
Figure 2.
Top: The collection of states used to model insertions that occur within a p=5 tandem repeat (a similar collection of states is used to model deletions). Each state’s look-back is shown within parentheses. Bottom: A p=5 tandem repeat that contains a length 3 insertion. The letters from the insertion are explained by a path through 3 I-states with look-back = (0), followed by a chain of J-states with look-back = 3 + 5 = (8).
Figure 3.
Figure 3.
A period 4 repeat with two subrepeats (“AAAC” and “GGTT”), each containing multiple substitutions. To find the change in pattern, ULTRA slides two adjacent windows along the sequence and creates profiles representing the repetitive content within the windows. The repetitive profiles of two adjacent windows are compared against each other using JSD. This figure shows the local window profiles (with profile frequencies displayed as bar charts) and the corresponding JSD for three different positions. The first pair of window profiles contains similar repetitive content resulting in a small JSD; the last profile pair also yields a small JSD. The middle windows contain different repetitive content, resulting in a large JSD that passes ULTRA’s splitting threshold. Both the repetitive region as a whole and also the repetitive region’s subrepeats are included in ULTRA’s final annotation.
Figure 4.
Figure 4.
Coverage and estimated false coverage for ULTRA, tantan, and TRF. The top chart show coverage when using a maximum repeat period of 10 and the bottom chart show coverage when using a maximum repeat period of 500. Plain bars indicate default parameters and textured bars indicate grid-search optimized parameters. We also include results using ULTRA --tune (with default settings). The estimated false discovery rate (FDR) is displayed below each bar. Note that in some case, there is no parameter choice that achieves less than 10% FDR; in these cases, no bar is presented, and the FDR value is listed as —.
Figure 5.
Figure 5.
Annotation score distributions. Using each tool to label 10Gb of 60% AT-rich random sequence, the left and right plots show per-repeat score distributions for ULTRA and TRF, respectively. The middle plot shows the distribution of emphtantan per-letter probabilities of being part of a repetitive region. Horizontal axis corresponds to score/probability values and the vertical axis corresponds to value frequency. The exponential decay of ULTRA enables reliable P-value estiamtes.
Figure 6.
Figure 6.
Repeat splitting accuracy vs sequence substitution rate.

References

    1. Zhang Hongxi, Li Douyue, Zhao Xiangyan, Pan Saichao, Wu Xiaolong, Peng Shan, Huang Hanrou, Shi Ruixue, and Tan Zhongyang. Relatively semi-conservative replication and a folded slippage model for short tandem repeats. BMC Genomics, 21:1–14, 2020. - PMC - PubMed
    1. Li You-Chun, Korol Abraham B, Fahima Tzion, Beiles Avigdor, and Nevo Eviatar. Microsatellites: genomic distribution, putative functions and mutational mechanisms: a review. Molecular Ecology, 11(12):2453–2465, 2002. - PubMed
    1. Zattera Michelle Louise and Bruschi Daniel Pacheco. Transposable elements as a source of novel repetitive DNA in the eukaryote genome. Cells, 11(21):3373, 2022. - PMC - PubMed
    1. Iyer Ravi R, Pluciennik Anna, Napierala Marek, and Wells Robert D. DNA triplet repeat expansion and mismatch repair. Annual Review of Biochemistry, 84:199–226, 2015. - PMC - PubMed
    1. Alec J Jeffreys Victoria Wilson, and Thein Swee Lay. Hypervariable ‘minisatellite’ regions in human DNA. Nature, 314(6006):67–73, 1985. - PubMed

Publication types