Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Nov 2;101(5):700-715.
doi: 10.1016/j.ajhg.2017.09.013.

Profiling of Short-Tandem-Repeat Disease Alleles in 12,632 Human Whole Genomes

Affiliations

Profiling of Short-Tandem-Repeat Disease Alleles in 12,632 Human Whole Genomes

Haibao Tang et al. Am J Hum Genet. .

Abstract

Short tandem repeats (STRs) are hyper-mutable sequences in the human genome. They are often used in forensics and population genetics and are also the underlying cause of many genetic diseases. There are challenges associated with accurately determining the length polymorphism of STR loci in the genome by next-generation sequencing (NGS). In particular, accurate detection of pathological STR expansion is limited by the sequence read length during whole-genome analysis. We developed TREDPARSE, a software package that incorporates various cues from read alignment and paired-end distance distribution, as well as a sequence stutter model, in a probabilistic framework to infer repeat sizes for genetic loci, and we used this software to infer repeat sizes for 30 known disease loci. Using simulated data, we show that TREDPARSE outperforms other available software. We sampled the full genome sequences of 12,632 individuals to an average read depth of approximately 30× to 40× with Illumina HiSeq X. We identified 138 individuals with risk alleles at 15 STR disease loci. We validated a representative subset of the samples (n = 19) by Sanger and by Oxford Nanopore sequencing. Additionally, we validated the STR calls against known allele sizes in a set of GeT-RM reference cell-line materials (n = 6). Several STR loci that are entirely guanine or cytosines (G or C) have insufficient read evidence for inference and therefore could not be assayed precisely by TREDPARSE. TREDPARSE extends the limit of STR size detection beyond the physical sequence read length. This extension is critical because many of the disease risk cutoffs are close to or beyond the short sequence read length of 100 to 150 bases.

Keywords: genetic disorder; genome sequencing; genotyping; microsatellites; population genetics; short tandem repeats; trinucleotide repeat expansion.

PubMed Disclaimer

Figures

Figure 1
Figure 1
TREDPARSE Workflow for Calling STR-Related Genetic Disease The workflow includes ploidy inference, read realignment, and integration of various types of evidence in a probabilistic model.
Figure 2
Figure 2
Integrated Probabilistic Model for Calling STRs with Four Types of Evidence (A) Model based on spanning reads. (B) Model based on partial reads. (C) Model based on repeat-only reads. (D) Model based on paired-end reads. (E) Predictive power for each of the four evidence types on the range of STR repeat lengths. Darker shades of green represent higher confidence of inference.
Figure 3
Figure 3
Examples of Posterior Probability Density Function Based on the Integrated Model for Calling STRs (A) Simulated diploid with h1=20,h2=140; there are no uncertainties around h1 and some uncertainties around h2. (B) Simulated diploid with h1=70,h2=140, showing a slight negative dependence between h1 and h2.
Figure 4
Figure 4
Simulations with Synthetic Datasets of Implanted STR Alleles at the Huntington Locus (A) Performance comparison of TREDPARSE and lobSTR on a simulated haploid with one single allele with h number of CAGs, where h varies from 1 to 300. (B) Performance comparison of TREDPARSE and lobSTR on a simulated diploid with two alleles, one allele fixed with 20 CAGs and another allele with h units of CAGs. (C) Performance of TREDPARSE on a simulated diploid with a low haploid depth of 5×. (D) Performance of TREDPARSE on a simulated diploid with a high haploid depth of 80×. Shaded regions represent a 95% credible interval for TREDPARSE estimates of h. RMSD represents the root-mean-square deviation, calculated as RMSD=1Ni=1N(hihˆi)2, where N=150.
Figure 5
Figure 5
Testing and Validation of TREDPARSE on 12,632 Whole-Genome Sequences We ran TREDPARSE on sequence data from 12,632 individuals and identified 138 individuals with risk alleles at a total of 15 disease loci. A subset of the inferred at-risk samples were validated by complementary sequencing experiments.
Figure 6
Figure 6
Individuals with Risk Alleles at the Huntington Disease Locus in Whole-Genome Samples (A) A family with the putative HD risk allele transmitted between generations. (B) A second family with the putative DM1 risk allele transmitted between generations. (C) A third family with the putative SCA17 allele transmitted between generations. The expanded risk alleles are highlighted in red. For both alleles, the 95% credible intervals are provided below the estimates. Age refers to the biological age of the individual at the time when the DNA sample was taken.

References

    1. Fan H., Chu J.-Y. A brief review of short tandem repeat mutation. Genomics Proteomics Bioinformatics. 2007;5:7–14. - PMC - PubMed
    1. Zhivotovsky L.A., Underhill P.A., Cinnioğlu C., Kayser M., Morar B., Kivisild T., Scozzari R., Cruciani F., Destro-Bisol G., Spedini G. The effective mutation rate at Y chromosome short tandem repeats, with application to human population-divergence time. Am. J. Hum. Genet. 2004;74:50–61. - PMC - PubMed
    1. Helgason A., Einarsson A.W., Guðmundsdóttir V.B., Sigurðsson Á., Gunnarsdóttir E.D., Jagadeesan A., Ebenesersdóttir S.S., Kong A., Stefánsson K. The Y-chromosome point mutation rate in humans. Nat. Genet. 2015;47:453–457. - PubMed
    1. Hares D.R. Selection and implementation of expanded CODIS core loci in the United States. Forensic Sci. Int. Genet. 2015;17:33–34. - PubMed
    1. Gymrek M., McGuire A.L., Golan D., Halperin E., Erlich Y. Identifying personal genomes by surname inference. Science. 2013;339:321–324. - PubMed