Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 May;25(5):750-61.
doi: 10.1101/gr.182212.114. Epub 2015 Feb 6.

MIPSTR: a method for multiplex genotyping of germline and somatic STR variation across many individuals

Affiliations

MIPSTR: a method for multiplex genotyping of germline and somatic STR variation across many individuals

Keisha D Carlson et al. Genome Res. 2015 May.

Erratum in

Abstract

Short tandem repeats (STRs) are highly mutable genetic elements that often reside in regulatory and coding DNA. The cumulative evidence of genetic studies on individual STRs suggests that STR variation profoundly affects phenotype and contributes to trait heritability. Despite recent advances in sequencing technology, STR variation has remained largely inaccessible across many individuals compared to single nucleotide variation or copy number variation. STR genotyping with short-read sequence data is confounded by (1) the difficulty of uniquely mapping short, low-complexity reads; and (2) the high rate of STR amplification stutter. Here, we present MIPSTR, a robust, scalable, and affordable method that addresses these challenges. MIPSTR uses targeted capture of STR loci by single-molecule Molecular Inversion Probes (smMIPs) and a unique mapping strategy. Targeted capture and our mapping strategy resolve the first challenge; the use of single molecule information resolves the second challenge. Unlike previous methods, MIPSTR is capable of distinguishing technical error due to amplification stutter from somatic STR mutations. In proof-of-principle experiments, we use MIPSTR to determine germline STR genotypes for 102 STR loci with high accuracy across diverse populations of the plant A. thaliana. We show that putatively functional STRs may be identified by deviation from predicted STR variation and by association with quantitative phenotypes. Using DNA mixing experiments and a mutant deficient in DNA repair, we demonstrate that MIPSTR can detect low-frequency somatic STR variants. MIPSTR is applicable to any organism with a high-quality reference genome and is scalable to genotyping many thousands of STR loci in thousands of individuals.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
MIPSTR determines germline and somatic STR variation through targeted capture, sequencing, and a novel mapping strategy. (A) Single-molecule molecular inversion probe (smMIP) with common backbone for PCR primer binding (dark-green; also shown, PCR and sequencing primers with arrows and purple sequencing adapters); 12 base pair degenerate tag (striped, green/white); and targeting arms with locus-specific, STR-flanking sequence (blue). One targeting arm is the primer for polymerase extension (extension arm). Ligation closes the circle at the other targeting arm (ligation arm). (B) Capture across genetically diverse individuals identifies germline STR variation across genetically diverse individuals. (C) MIPSTR distinguishes somatic STR variation from technical error, using many degenerate tags. STR variation within a tag-defined read group (i.e., reads with the same degenerate tag) is considered technical error. STR variation across tag-defined read groups is considered somatic variation. (D) MIPSTR maps reads from a given STR locus (based on targeting arm sequence) to its locus-specific synthetic reference with unit numbers 1–100 (1–7 shown here). The STR 1 read aligns perfectly to locus-specific synthetic reference Unit #6 (green check mark); all other alignments show gaps (dashed line, red X). SNVs (in pink), even if occurring in the STR sequence, do not affect mapping or STR unit number genotype calls.
Figure 2.
Figure 2.
MIPSTR accurately determined germline STR unit number in the reference strain Col-0. Raw read counts at 30 representative STR loci, with reference genome STR unit number indicated in green. UNK indicates gene of unknown function. Numbers shown in parentheses refer to STR IDs (see Supplemental Table 1). Two instances of genomic duplication (residing in transposons) are shown (STR ID 73 and 89); both alleles showed comparable read counts. Note that erroneous calls show low read counts or high technical error. Bold red outlines indicate examples discussed further in the text.
Figure 3.
Figure 3.
MIPSTR distinguished technical error from somatic variation. (A) Three histograms from Figure 2 with total read counts. (Left) The known ELF3 STR unit number is clearly supported by the modal unit number. (Middle) This intergenic STR showed great variation in STR unit number; the mode did not support the known STR unit number. (Right) This STR resides in two copies in two different genomic locations (transposons). Both known alleles were identified, yet total read counts alone cannot distinguish genomic duplicates from technical or somatic error. (B) Reads are separated into tag-defined read groups with dot sizes and shading representing read count (different scales for each locus, see inset). Colored boxes are shown in detail in C. (Left) All tag-defined read groups with one exception supported the known STR unit number seven. Most tag-defined read groups showed low levels of technical error, primarily reads with unit number six (-1), but also five and eight. (Middle) Separating reads into tag-defined read groups illustrates the extremely high technical error for this STR. The mode of a tag-defined read group was often supported by <50% of total reads. Some tag-defined read groups contained as many as six different STR genotypes. We exclude such loci from the analysis of somatic STR variation. (Right) As expected for a duplicate STR or a heterozygote, approximately half of the tag-defined read groups support each of the known STR genotypes with very little technical error. We also observed evidence of a somatic STR allele with unit number six, which was supported by two tag-defined read groups (boxed, black outline). Note the absence of either of the known STR alleles for these tag-defined read groups. This STR genotype is also visible in the total read count histogram (A, right), where it would be interpreted as a technical error by other methods. (C) Detailed views of plots in B; outline color corresponds to respective plot.
Figure 4.
Figure 4.
MIPSTR accurately determined germline ELF3 STR unit number across genetically diverse A. thaliana strains. Histograms of raw read counts across 30 accessions. STR unit number as determined by Sanger sequencing is indicated in green. Using tag-defined read groups, the Kin-0 ELF3 STR genotype can be resolved to the known STR genotype even with comparatively few total reads. MIPSTR clearly calls STR unit number 19 for Pro-0. Note that different individuals of the same strain were analyzed with MIPSTR and Sanger sequencing, which may explain the discrepancy. Bold red outline indicates example discussed further in the text.
Figure 5.
Figure 5.
Observed and predicted STR variation showed greater correlation for noncoding STRs than coding STRs. The correlation between the observed log10 of the standard deviation of STR unit number across strains (y-axis) and the VARscore (x-axis), which predicts STR variation from sequence characteristics. Black points are noncoding STRs, red points are coding STRs. Outliers may indicate functional importance (ELF3 STR is indicated).
Figure 6.
Figure 6.
MIPSTR detects low frequency STR alleles. (x-axis) Tested mixtures of Ler and Col-0 DNA; (y-axis) probability of detecting Col-0 STR alleles; (closed circles) observed frequency of observing Col-0 STR alleles (standard error is indicated, black lines); (open circles) predicted frequency of observing Col-0 STR alleles. To calculate the observed frequency for each mixture, we resampled tag-defined read group modes supporting either the Col-0 or Ler allele at each STR locus 1000 times. The proportion of samples that carry the Col-0 allele was determined and averaged across all STR loci that differ between Ler and Col-0. To calculate the expected probability for each mixture, we assumed the known ratios of Col-0 and Ler STR alleles in each mixture and the probability of observing the Col-0 STR allele with 10 observations.

References

    1. Atwell S, Huang YS, Vilhjálmsson BJ, Willems G, Horton M, Li Y, Meng D, Platt A, Tarone AM, Hu TT, et al.2010. Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature 465: 627–631. - PMC - PubMed
    1. Baslan T, Kendall J, Rodgers L, Cox H, Riggs M, Stepansky A, Troge J, Ravi K, Esposito D, Lakshmi B, et al.2012. Genome-wide copy number analysis of single cells. Nat Protoc 7: 1024–1041. - PMC - PubMed
    1. Boland CR, Thibodeau SN, Hamilton SR, Sidransky D, Eshleman JR, Burt RW, Meltzer SJ, Rodriguez-Bigas MA, Fodde R, Ranzani GN, et al.1998. A National Cancer Institute Workshop on Microsatellite Instability for cancer detection and familial predisposition: development of international criteria for the determination of microsatellite instability in colorectal cancer. Cancer Res 58: 5248–5257. - PubMed
    1. Boyle EA, O'Roak BJ, Martin BK, Kumar A, Shendure J. 2014. MIPgen: optimized modeling and design of molecular inversion probes for targeted resequencing. Bioinformatics 30: 2670–2672. - PMC - PubMed
    1. Butler AP, Trono D, Coletta LD, Beard R, Fraijo R, Kazianis S, Nairn RS. 2007. Regulation of CDKN2A/B and Retinoblastoma genes in Xiphophorus melanoma. Comp Biochem Physiol C Toxicol Pharmacol 145: 145–155. - PubMed

Publication types

MeSH terms

Associated data