Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Oct 9;4(1):vbae149.
doi: 10.1093/bioadv/vbae149. eCollection 2024.

ULTRA-effective labeling of tandem repeats in genomic sequence

Affiliations

ULTRA-effective labeling of tandem repeats in genomic sequence

Daniel R Olson et al. Bioinform Adv. .

Abstract

In the age of long read sequencing, genomics researchers now have access to accurate repetitive DNA sequence (including satellites) that, due to the limitations of short read-sequencing, could previously be observed only as unmappable fragments. Tools that annotate repetitive sequence are now more important than ever, so that we can better understand newly uncovered repetitive sequences, and also so that we can mitigate errors in bioinformatic software caused by those repetitive sequences. To that end, we introduce the 1.0 release of our tool for identifying and annotating locally repetitive sequence, ULTRA Locates Tandemly Repetitive Areas (ULTRA). ULTRA is fast enough to use as part of an efficient annotation pipeline, produces state-of-the-art reliable coverage of repetitive regions containing many mutations, and provides interpretable statistics and labels for repetitive regions.

Availability and implementation: ULTRA is released under an open source license, and is available for download at https://github.com/TravisWheelerLab/ULTRA.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
ULTRA’s HMM. The cloud shaped nodes represent a collection of states modeling both tandem repeats and also insertion/deletion events (see Section 2.2.1). Self-transition edges have not been drawn but do exist for the nonrepetitive and repetitive states. Similar to tantan, ULTRA models large period repeats as being less common than small period repeats through a decay parameter, λ. For a model allowing maximum period k, the probability of transitioning from the start state to a period p repetitive state is γp=λp÷i=1k(λi). All labeled parameters can be adjusted (see the user-guide at https://github.com/TravisWheelerLab/ULTRA).
Figure 2.
Figure 2.
Top: The collection of states used to model insertions that occur within a p=5 tandem repeat (a similar collection of states is used to model deletions). Each state’s look-back is shown within parentheses. Bottom: A p=5 tandem repeat that contains a length 3 insertion. The letters from the insertion are explained by a path through 3 I-states with look-back =(0), followed by a chain of J-states with look-back =3+5=(8).
Figure 3.
Figure 3.
A period 4 repeat with two subrepeats (“AAAC” and “GGTT”), each containing multiple substitutions. To find the change in pattern, ULTRA slides two adjacent windows along the sequence and creates profiles representing the repetitive content within the windows. The repetitive profiles of two adjacent windows are compared against each other using JSD. This figure shows the local window profiles (with profile frequencies displayed as bar charts) and the corresponding JSD for three different positions. The first pair of window profiles contains similar repetitive content resulting in a small JSD; the last profile pair also yields a small JSD. The middle windows contain different repetitive content, resulting in a large JSD that passes ULTRA’s splitting threshold. Both the repetitive region as a whole and also the repetitive region’s subrepeats are included in ULTRA’s final annotation.
Figure 4.
Figure 4.
Coverage and estimated false coverage for ULTRA, tantan, and TRF. The top chart show coverage when using a maximum repeat period of 10 and the bottom chart show coverage when using a maximum repeat period of 500. Plain bars indicate default parameters and textured bars indicate grid-search optimized parameters. We also include results using ULTRA--tune (with default settings). The estimated false discovery rate (FDR) is displayed below each bar. Note that in some case, there is no parameter choice that achieves <10% FDR; in these cases, no bar is presented, and the FDR value is listed as —.
Figure 5.
Figure 5.
Annotation score distributions. Using each tool to label 10 GB of 60% AT-rich random sequence, the left and right plots show per-repeat score distributions for ULTRA and TRF, respectively. The middle plot shows the distribution of emphtantan per-letter probabilities of being part of a repetitive region. Horizontal axis corresponds to score/probability values and the vertical axis corresponds to value frequency. The exponential decay of ULTRA enables reliable P-value estimates.
Figure 6.
Figure 6.
Repeat splitting accuracy versus sequence substitution rate.

Update of

References

    1. Altemose N, Logsdon GA, Bzikadze AV. et al. Complete genomic and epigenetic maps of human centromeres. Science 2022;376:eabl4178. - PMC - PubMed
    1. Altschul SF, Gish W, Miller W. et al. Basic local alignment search tool. J Mol Biol 1990;215:403–10. - PubMed
    1. Beier S, Thiel T, Münch T. et al. MISA-web: a web server for microsatellite prediction. Bioinformatics 2017;33:2583–5. - PMC - PubMed
    1. Bennett P. Demystified…: microsatellites. Mol Pathol 2000;53:177. - PMC - PubMed
    1. Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 1999;27:573–80. - PMC - PubMed

LinkOut - more resources