Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Feb 7:8:14291.
doi: 10.1038/ncomms14291.

CRISPR-Cas9-targeted fragmentation and selective sequencing enable massively parallel microsatellite analysis

Affiliations

CRISPR-Cas9-targeted fragmentation and selective sequencing enable massively parallel microsatellite analysis

GiWon Shin et al. Nat Commun. .

Abstract

Microsatellites are multi-allelic and composed of short tandem repeats (STRs) with individual motifs composed of mononucleotides, dinucleotides or higher including hexamers. Next-generation sequencing approaches and other STR assays rely on a limited number of PCR amplicons, typically in the tens. Here, we demonstrate STR-Seq, a next-generation sequencing technology that analyses over 2,000 STRs in parallel, and provides the accurate genotyping of microsatellites. STR-Seq employs in vitro CRISPR-Cas9-targeted fragmentation to produce specific DNA molecules covering the complete microsatellite sequence. Amplification-free library preparation provides single molecule sequences without unique molecular barcodes. STR-selective primers enable massively parallel, targeted sequencing of large STR sets. Overall, STR-Seq has higher throughput, improved accuracy and provides a greater number of informative haplotypes compared with other microsatellite analysis approaches. With these new features, STR-Seq can identify a 0.1% minor genome fraction in a DNA mixture composed of different, unrelated samples.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interests.

Figures

Figure 1
Figure 1. Overview of STR-Seq.
(a) Guide RNAs and primer probes were designed to target STRs and proximal SNPs. We target both plus and minus strands with only the plus strand targeting illustrated. In the first step, Cas9 enzyme cleaves upstream of STR. The DNA libraries including the STR and SNP are target sequenced. (b) After initial alignment of Read 2 from any given paired-end set, we use the primer probe sequence derived from Read 2 as an index tag to link the Read 1 microsatellite internal motif and flanking sequences. If the primer probe sequence aligns within 2 bp of the expected primer probe start position, the paired Read 1 was assigned to its specific STR index tag. Based on the human genome reference, we identified the flanking genomic sequences that mark the complete STR segment and then determined the composition (that is, mononucleotide, dinucleotide and so on) and overall length of the repeat motif structure. Read 1 sequences that contained both the 5′ and 3′ flanking sequences with the internal microsatellite were used for genotyping. STR genotypes are called from Read 1. SNPs are phased with the STR genotype to generate haplotypes. (c) As an example of STR-Seq haplotyping, paired end alignments to the reference genome are shown for a STR target (trf747130) for sample NA12878. After the STR genotyping process, 114 and 133 read pairs were identified to have 11 and 8 repeats of a tetranucleotide motif (ATGA) in their Read 1s, respectively. Within each read pair group, all the base calls at the SNP position were identical, being either C (reference) or G (alternative). The site where CRISPR–Cas9 targets is indicated with red arrow, and the two haplotypes are illustrated on the bottom.
Figure 2
Figure 2. Performance of STR-Seq.
(a) The STR alleles determined by STR-Seq and CE are compared using a ‘dosage' value that is derived from the number of base pairs remaining after subtracting the reference allele. The R-squared value is shown at the top left in the plot, and the dotted diagonal line indicates 1:1 concordance. (b) BAT26 is an example where the true STR allele was obscured by artificial indels. The bar graphs show read counts for all observed alleles both for PCR-amplified (blue) and PCR-free (red) STR-Seq analyses. PCR-free STR-Seq analysis reduced the fraction of stutter artifact from 64 to 30%. The STR allelotype is indicated by number of motif repeats, and the true allelotype is indicated with the black arrow on the top of the corresponding bar. (c) The distributions of stutter artifact fractions are shown for NA12878's 686 STRs. For each STR, number of non-allelic reads is divided by the total number STR-spanning reads to get the fraction of artificial indels. Box plots for PCR-amplified (left) versus PCR-free (right) are shown top right. The horizontal thickness represents estimated and normalized Kernel density. The median values are indicated as black dots inside the grey boxes and the difference is significant (P<2.2e−16 by Wilcoxon signed-rank test).
Figure 3
Figure 3. Performance of targeted CRISPR–Cas9 fragmentation.
(a) For the STR target presented here (trf676281; [ATAG]n), two gRNAs were designed with two pairs of primer probes. Read depth and pile-up of Read 1s are compared between negative control and target-specifically fragmented sample DNAs. In the pile-up plots, Read 1s from plus probes (binding downstream of the STR) align to the reference itself (forward reads; blue) while those from minus probe align to the reverse complementary of reference (reverse reads; green). For the two CRISPR–Cas9 target sites, among all reads having an overlap with each, 92 and 67% shared their alignment start positions, respectively (indicated by red dotted arrows). Read depth for the STR region (shaded) was higher than that of other flanking regions when the targeted fragmentation was used. Pink-coloured blocks in read depth and pile-up plots indicate deletion events. In the reference genome, red, yellow, green and blue bars indicate A, C, G and T bases, respectively. (b) The read fraction distribution for 2,625 CRISPR–Cas9 target sites are shown that start or stop within 2 bp of the target cut site. The median values are indicated as white dots inside the black boxes, and the difference was significant (P<2.2e−16 by Wilcoxon signed rank test). The horizontal thickness represents estimated and normalized Kernel density. (c) Estimated Kernel density for observed fraction of heterozygous alleles is separately shown for STRs with (n=56) and without (n=56) gRNA targeting. The distribution is significantly different between negative control and test runs for gRNA-targeted STRs (top; P=3.8e−06 by Levene's test), but similar for non-gRNA-targeted STRs (bottom; P=0.96 by Levene's test).
Figure 4
Figure 4. Sensitive detection of minor component's haplotype in mixture DNA.
(a) Observed allele fractions of informative haplotypes are plotted against expected ratio based on the minor component fractions (25 to 0.1%) of a two-component mixture (HGDP00924 as minor and HGDP00925 as major). Most of the informative haplotypes are one of the two heterozygous alleles of the minor component, and their allelic fractions are half of the overall component fraction. For example, only one informative allele from the 10% ratio mixture (yellow dots) is expected to be 10% while the expected fraction for every other allele is 5%. The scale of both x- and y-axes are shown in log scale. The R-squared value is shown at the top left in the plot, and the dotted diagonal line indicates 1:1 concordance. (b) A mixture of two individuals (0.1% HGDP00924 and 99.9% HGDP00925) was analysed for a dinucleotide repeat (trf291274). M and N alleles indicate genotypes from the major and minor components, respectively. The bar graph in the right box shows read counts for all observed alleles separately for two SNP alleles found by STR-Seq analysis. A haplotype (11 motif repeats and G allele) specific to minor component was detectable. On the other hand, the bar graph on the bottom left shows collective read counts regardless of linked SNP genotype. Both alleles from minor components are not detectable because they are mixed with artificial indels from the major component.

References

    1. Budowle B., Shea B., Niezgoda S. & Chakraborty R. CODIS STR loci data from 41 sample populations. J. Forensic Sci. 46, 453–489 (2001). - PubMed
    1. Ellegren H. Microsatellites: simple sequences with complex evolution. Nat. Rev. Genet. 5, 435–445 (2004). - PubMed
    1. Verstrepen K. J., Jansen A., Lewitter F. & Fink G. R. Intragenic tandem repeats generate functional variability. Nat. Genet. 37, 986–990 (2005). - PMC - PubMed
    1. Eckert K. A. & Hile S. E. Every microsatellite is different: intrinsic DNA features dictate mutagenesis of common microsatellites present in the human genome. Mol. Carcinog. 48, 379–388 (2009). - PMC - PubMed
    1. Legendre M., Pochet N., Pak T. & Verstrepen K. J. Sequence-based estimation of minisatellite and microsatellite repeat variability. Genome Res. 17, 1787–1796 (2007). - PMC - PubMed

Publication types