Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jun 7;12(1):9352.
doi: 10.1038/s41598-022-13024-4.

Linked-read sequencing for detecting short tandem repeat expansions

Affiliations

Linked-read sequencing for detecting short tandem repeat expansions

Readman Chiu et al. Sci Rep. .

Abstract

Detection of short tandem repeat (STR) expansions with standard short-read sequencing is challenging due to the difficulty in mapping multicopy repeat sequences. In this study, we explored how the long-range sequence information of barcode linked-read sequencing (BLRS) can be leveraged to improve repeat-read detection. We also devised a novel algorithm using BLRS barcodes for distance estimation and evaluated its application for STR genotyping. Both approaches were designed for genotyping large expansions (> 1 kb) that cannot be sized accurately by existing methods. Using simulated and experimental data of genomes with STR expansions from multiple BLRS platforms, we validated the utility of barcode and phasing information in attaining better STR genotypes compared to standard short-read sequencing. Although the coverage bias of extremely GC-rich STRs is an important limitation of BLRS, BLRS is an effective strategy for genotyping many other STR loci.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
IRR extraction using barcodes in BLRS. (a) Steps in using barcodes for IRR extraction and STR size estimation. (b–d) IRR counts (left) and repeat count estimates (right) of the target loci within the three groups of datasets: (b) heterozygous ATTCT expansion in ATXN10 10 × and stLFR simulations; (c) homozygous larger-than-reference CCAT polymorphism in NA12878 from 10x, stLFR, and TELL-Seq BLRS platforms; and (d) FXN GAA expansions in 10 × data of four Coriell cell lines. The methods in comparison were EH without OTS (EH_noOTS, blue), EH with OTS (EH_OTS, olive), and barcode-based IRR extraction (barcode, red). For the FXN samples, EH results (with or without OTS) from standard Illumina data (EH_OTS(S), light blue and EH_noOTS(S), light olive) were also included. Only results of the expanded alleles in the samples were shown (therefore two separate tallies for the homozygous FXN GM15850 sample). “Expected” or “ground truth” IRR counts (for the simulations) and repeat counts were plotted as orange horizontal bars together with the exact numbers. Exact IRR or repeat counts were shown on top of each bar. Confidence intervals of the estimates reported by EH were shown as error bars. For the custom barcode-based method, the error bars reported for 10 × data corresponded to the range of estimates calculated independently using each of the two read lengths (see the Methods section). “NA” indicates that results were unavailable for certain samples because of a segmentation fault in EH runs.
Figure 2
Figure 2
Size estimation of genomic intervals and STR loci using Jaccard index of barcode sharing in BLRS. (a) Inverse relationship between Jaccard index and genomic interval size observed in NA12878 of each of the three BLRS platforms. The colored bands correspond to the 95% confidence intervals for each platform. (b) Schematic of a hypothetical example illustrating the concept and terminology in computing the Jaccard index of barcode sharing for a given genomic interval. (ce) Scatter plots of estimates (y-axis) vs. truths (x-axis) for ~ 700 arbitrary genomic intervals (black) and the target STR (red) in the simulation (c), NA12878 (d), FXN (e), and FMR1 (f) datasets. Only estimates of the expanded allele at the target loci were shown. Confidence intervals of the estimates of the target loci were shown as red error bars. Dotted red diagonal lines were added to help visualize the amount of deviation of the estimates from the true values. “Truths” (x-axis) for the ~ 700 genomic intervals in all plots were calculated based on hg38 genomic coordinates. “Truth” for the target locus is the size of the ATXN10 repeat we replaced the reference with in the modified genome to generate the simulated datasets (c); size of the CCAT allele we determined from the NA12878 assembly (d); sizes of the FXN (e) and FMR1 (f) repeats in the Coriell samples according to on-line information of the respective cell lines (Table S1).

Similar articles

Cited by

References

    1. Zheng GX, Lau BT, Schnall-Levin M, Jarosz M, Bell JM, Hindson CM, et al. Haplotyping germline and cancer genomes with high-throughput linked-read sequencing. Nat. Biotechnol. 2016;34:303–311. doi: 10.1038/nbt.3432. - DOI - PMC - PubMed
    1. Wang O, Chin R, Cheng X, Wu M, Mao Q, Tang J, et al. Efficient and unique co-barcoding of second-generation sequencing reads from long DNA molecules enabling cost effective and accurate sequencing, haplotyping, and de novo assembly. Genome Res. 2019;29(5):798–808. doi: 10.1101/gr.245126.118. - DOI - PMC - PubMed
    1. Chen Z, Pham L, Wu T-C, Mo G, Xia Y, Chang PL, et al. Ultralow-input single-tube linked-read library method enables short-read second-generation sequencing systems to routinely generate highly accurate and economical long-range sequencing information. Genome Res. 2020;30:898–909. doi: 10.1101/gr.260380.119. - DOI - PMC - PubMed
    1. Fang L, Kao C, Gonzalez MV, Mafra FA, Pellegrino da Silva R, Li M, et al. LinkedSV for detection of mosaic structural variants from linked-read exome and genome sequencing data. Nat. Commun. 2019;10:5585. doi: 10.1038/s41467-019-13397-7. - DOI - PMC - PubMed
    1. Marks P, Garcia S, Barrio AM, Belhocine K, Bernate J, Bharadwaj R, et al. Resolving the full spectrum of human genome variation using linked-reads. Genome Res. 2019;29:635–645. doi: 10.1101/gr.234443.118. - DOI - PMC - PubMed

Publication types

MeSH terms

Grants and funding