Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Feb 27;53(5):gkaf131.
doi: 10.1093/nar/gkaf131.

A detailed analysis of second and third-generation sequencing approaches for accurate length determination of short tandem repeats and homopolymers

Affiliations

A detailed analysis of second and third-generation sequencing approaches for accurate length determination of short tandem repeats and homopolymers

Sophie I Jeanjean et al. Nucleic Acids Res. .

Abstract

Microsatellites are short tandem repeats (STRs) of a motif of 1-6 nucleotides that are ubiquitous in almost all genomes and widely used in many biomedical applications. However, despite the development of next-generation sequencing (NGS) over the past two decades with new technologies coming to the market, accurately sequencing and genotyping STRs, particularly homopolymers, remain very challenging today due to several technical limitations. This leads in many cases to erroneous allele calls and difficulty in correctly identifying the genuine allele distribution in a sample. Here, we assessed several second and third-generation sequencing approaches in their capability to correctly determine the length of microsatellites using plasmids containing A/T homopolymers, AC/TG or AT/TA dinucleotide STRs of variable length. Standard polymerase chain reaction (PCR)-free and PCR-containing, single Unique Molecular Indentifier (UMI) and dual UMI 'duplex sequencing' protocols were evaluated using Illumina short-read sequencing, and two PCR-free protocols using PacBio and Oxford Nanopore Technologies long-read sequencing. Several bioinformatics algorithms were developed to correctly identify microsatellite alleles from sequencing data, including four and two modes for generating standard and combined consensus alleles, respectively. We provided a detailed analysis and comparison of these approaches and made several recommendations for the accurate determination of microsatellite allele length.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Graphical Abstract
Graphical Abstract
Figure 1.
Figure 1.
Experimental workflow of NGS experiments performed in our study. (A) Detail of the different steps performed in our study for short-read and long-read NGS library preparation using microsatellite-containing plasmids as DNA samples. Underlined PCR conditions correspond to those applied to randomly sheared DNA. (B) Schematic representation of the different molecular components of the five types of NGS libraries used in our study. Illumina short-read sequencing libraries are shown single stranded. When considering the double-stranded configuration, type I and II libraries included “Y” structures at each DNA end for PCR-free libraries due to stubby Y-adapters, while being fully complementary in PCR-containing libraries. For “duplex-sequencing” type III libraries, two strands of the same dsDNA molecule were barcoded at each DNA extremity allowing their identification—each strand would present either αβ or βα barcodes combination in reads 1 and 2—and error correction after PCR and sequencing.
Figure 2.
Figure 2.
Workflow of the different bioinformatic analyses performed using Illumina short-read sequencing data. (A) Standard microsatellite allele length analysis workflow used with type I, II, and III libraries. (B) Single-UMI error correction workflow used with type II libraries. (C) Duplex sequencing standard and combined error correction workflow used with type III libraries. “(Selected)” indicates the analyses, features or options, whose results are presented in the manuscript.
Figure 3.
Figure 3.
Workflow of the different bioinformatic analyses performed using (A) PacBio (type IV libraries) and (B) ONT (type V libraries) long-read sequencing data.
Figure 4.
Figure 4.
Effect of PCR cycles included in library preparation on microsatellite allele length obtained after Illumina short-read sequencing using SSDP. (A). n − 1, n, and n + 1 microsatellite allele frequencies obtained with PCR-free and PCR-containing (1–20 cycles) libraries and Illumina short-read sequencing using SSDP as templates. (B). Evolution of the percentage of original alleles of the nine studied microsatellites according to the number of PCR cycles. The figure only presents type I and II library data and each point includes at least five replicates. Data were generated on Illumina iSeq100 and NextSeq 500 instruments.
Figure 5.
Figure 5.
Impact of UMI error correction on Illumina short-read sequencing data for accurate length determination of microsatellites. (A) n −1, n, and n + 1 microsatellite allele frequencies obtained before (-) and after (+) UMI error correction (Max. Freq. mode) from type II libraries including 12, 16, and 20 PCR cycles and using SSDP and RFP as templates. (B) Reduction of the error rate in the length of microsatellite alleles after UMI error correction, expressed in fold-change. Type I PCR-free data from Illumina short-read sequencing were also presented in panel A for comparison. Each point included at least triplicate experimental data, except for PCR-free – SSDP and (AC)25 – 12 cycles – SSDP – UMI error correction conditions (duplicates). All data were generated on an Illumina NextSeq 500.
Figure 6.
Figure 6.
Impact of duplex sequencing dual-UMI error correction on Illumina short-read sequencing data for accurate length determination of microsatellites. (A). n − 1, n, and n + 1 microsatellite allele frequencies obtained before (-) and after (+) different types of UMI-based error corrections from type III libraries including 16 PCR cycles and using SSDP and RFP as templates. Error corrections included standard UMI error correction (Max. Freq. mode) obtained either from αβ reads, βα reads or both read types (αβ+βα), and combined error correction obtained from αβ and βα consensus sequences, either based on the maximum (the largest allele is kept as consensus) or equal length (the consensus is reached when the αβ and βα consensus alleles are of the same length). (B). Reduction of the error rate in the length of microsatellite alleles after error corrections, expressed in fold-change. Type I PCR-free data from Illumina short-read sequencing were presented in panel A for comparison. Each point originated from duplicate (SSDP) or triplicate (RFP) experiment data, however, for combined error correction, some data points were partially or totally (no bars in the original n alleles) lost. All data were generated on an Illumina NextSeq 500.
Figure 7.
Figure 7.
Accurate length determination of microsatellites from Pacbio long-read sequencing (type IV libraries) using different strategies. (A). ZMW count per minimum number of subreads. (B). CCS read count per minimum Quality Score C. n − 1, n, and n + 1 microsatellite allele frequencies obtained from Pacbio long-read sequencing using different approaches for accurate length determination of microsatellites. Quality score approach is based on different thresholds of CCS high-fidelity read quality. Maximum frequency, mean length and median length are in-house approaches similar to those developed for UMI-error correction (see Supplementary Fig. S6A), based on a defined number of subreads from ZMWs with a Barcode Score of 70 to generate a microsatellite consensus sequence. Conditions with fewer than 100 consensus sequences are not represented on the graph (no bars in the original n alleles). Each point corresponds to a single experiment. Type I PCR-free – SSDP data (two replicates) from Illumina short-read sequencing (NextSeq 500) were also presented in panel B for comparison.
Figure 8.
Figure 8.
Accurate length determination of microsatellites from ONT long-read sequencing (type V libraries) using different strategies. (A). Simplex and duplex reads count per minimum Quality Score. (B). Duplex read rate (%) per library. (C). n − 1, n, and n + 1 microsatellite allele frequencies obtained from ONT long-read sequencing using different approaches for accurate length determination of microsatellites. Simplex read analysis considered either the forward, the reverse or all reads using two minimum Quality Score thresholds (≥Q15 and ≥Q25). Duplex read analysis was based on two minimum Quality Score thresholds (≥Q25 and ≥Q33). Consensus analysis was based on paired simplex reads identified from duplex reads (≥Q25 and ≥Q33), using either maximum length or considering the length only when they were equal. Each point represents a triplicate experiment. PCR-free – SSDP data (two replicates) from Illumina short-read sequencing (NextSeq 500) were also presented for comparison.

References

    1. Ellegren H Microsatellites: simple sequences with complex evolution. Nat Rev Genet. 2004; 5:435–45.10.1038/nrg1348. - DOI - PubMed
    1. Strand M, Prolla TA, Liskay RM et al. . Destabilization of tracts of simple repetitive DNA in yeast by mutations affecting DNA mismatch repair. Nature. 1993; 365:274–6.10.1038/365274a0. - DOI - PubMed
    1. Sia EA, Kokoska RJ, Dominska M et al. . Microsatellite instability in yeast: dependence on repeat unit size and DNA mismatch repair genes. Mol Cell Biol. 1997; 17:2851–8.10.1128/MCB.17.5.2851. - DOI - PMC - PubMed
    1. Boland CR, Goel A Microsatellite instability in colorectal cancer. Gastroenterology. 2010; 138:2073–87.10.1053/j.gastro.2009.12.064. - DOI - PMC - PubMed
    1. Hause RJ, Pritchard CC, Shendure J et al. . Classification and characterization of microsatellite instability across 18 cancer types. Nat Med. 2016; 22:1342–50.10.1038/nm.4191. - DOI - PubMed

LinkOut - more resources