Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Jul 18;9(1):65.
doi: 10.1186/s13073-017-0456-7.

Interrogating the "unsequenceable" genomic trinucleotide repeat disorders by long-read sequencing

Affiliations

Interrogating the "unsequenceable" genomic trinucleotide repeat disorders by long-read sequencing

Qian Liu et al. Genome Med. .

Abstract

Microsatellite expansion, such as trinucleotide repeat expansion (TRE), is known to cause a number of genetic diseases. Sanger sequencing and next-generation short-read sequencing are unable to interrogate TRE reliably. We developed a novel algorithm called RepeatHMM to estimate repeat counts from long-read sequencing data. Evaluation on simulation data, real amplicon sequencing data on two repeat expansion disorders, and whole-genome sequencing data generated by PacBio and Oxford Nanopore technologies showed superior performance over competing approaches. We concluded that long-read sequencing coupled with RepeatHMM can estimate repeat counts on microsatellites and can interrogate the "unsequenceable" genomic trinucleotide repeat disorders.

Keywords: Long-read sequencing; Microsatellites; Nanopore; PacBio; RepeatHMM; Trinucleotide repeat disorders; Trinucleotide repeats.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

The study protocol was approved by the China-Japan Friendship Hospital and was conducted in compliance with the Declaration of Helsinki. Written informed consent was obtained from all participants before enrollment.

Consent for publication

All patient information was anonymized at source and unique ID codes were used to identify cases. Publication of de-identified results from all consenting participants was approved.

Competing interests

PZ and DW are employees of Nextomics Bioscences and KW is an advisor for Nextomics Biosciences. All other authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
A flowchart of the procedure to infer repeat counts using RepeatHMM
Fig. 2
Fig. 2
Analysis on simulation data to infer repeat counts for ATN1. a Performance on simulated long reads with random start and end sites that cover repeats. b Performance on simulated long reads with fixed start and end sites that cover repeats. c, d The distribution of the prediction errors (estimated repeat counts minus simulated counts) on random simulation data and PCR-based simulation data, respectively. RMSE root mean square error between simulated repeat counts and estimated counts for 100 participants
Fig. 3
Fig. 3
Performance of RepeatHMM, RepeatCCS, BAMself, and TRhist on estimating the repeat counts in ATXN3 for 20 patients with SCA3 and five controls. The gold standards (x-axis) were determined by capillary electrophoresis for 20 patients or by Sanger sequencing for five controls. a Scatterplot of estimated repeat counts and true counts. b, c The difference of estimated repeat counts and true counts by RepeatHMM, RepeatCCS, BAMself, and TRhist. RepeatCCS refers to the use of RepeatHMM on error-corrected reads generated by the circular consensus sequencing protocol
Fig. 4
Fig. 4
The distribution of repeat counts estimated by RepeatHMM for three patients with SCA10. The estimation of the pathogenic alleles by RepeatHMM for the three subjects A, B and C were 830 (a), 825 (b) and 488 (c), and the estimation by gel electrophoresis were ~840, ~820 and ~530, respectively
Fig. 5
Fig. 5
Comparison of the estimation of repeat counts on NA12878 using three sequencing platforms. The sequencing platforms include Illumina short-read sequencing, PacBio long-read sequencing (a), and Nanopore long-read sequencing (b). We examined 40 microsatellites with repeat units in the range of 2–5 bp, which are short enough to be confidently called by the Illumina data
Fig. 6
Fig. 6
The analysis of ATXN3 in HX1 using three different sequencing techniques. a Whole-genome long-read sequencing with ~100X coverage. b PCR-based long-read sequencing with three randomly down-sampled datasets, each with ~100X coverage. c Sanger sequencing. All methods concordantly predicted that there were 14 CAG repeats in ATXN3

Similar articles

Cited by

References

    1. Kovtun IV, McMurray CT. Features of trinucleotide repeat instability in vivo. Cell Res. 2008;18(1):198–213. doi: 10.1038/cr.2008.5. - DOI - PubMed
    1. McMurray CT. Mechanisms of trinucleotide repeat instability during human development. Nat Rev Genet. 2010;11(11):786–99. doi: 10.1038/nrg2828. - DOI - PMC - PubMed
    1. Lima M, Costa MC, Montiel R, Ferro A, Santos C, Silva C, et al. Population genetics of wild-type CAG repeats in the Machado-Joseph Disease gene in Portugal. Hum Hered. 2005;60(3):156–63. doi: 10.1159/000090035. - DOI - PubMed
    1. Bettencourt C, Lima M. Machado-Joseph Disease: from first descriptions to new perspectives. Orphanet J Rare Dis. 2011;6(1):1–12. doi: 10.1186/1750-1172-6-35. - DOI - PMC - PubMed
    1. Spada ARL, Wilson EM, Lubahn DB, Harding AE, Fischbeck KH. Androgen receptor gene mutations in X-linked spinal and bulbar muscular atrophy. Nature. 1991;352(6330):77–9. doi: 10.1038/352077a0. - DOI - PubMed

Publication types