. 2018 Aug 21;13(1):46.

doi: 10.1186/s13024-018-0274-4.

Long-read sequencing across the C9orf72 'GGGGCC' repeat expansion: implications for clinical use and genetic discovery efforts in human disease

Mark T W Ebbert^{1

2}, Stefan L Farrugia³, Jonathon P Sens^{3

4}, Karen Jansen-West³, Tania F Gendron³, Mercedes Prudencio³, Ian J McLaughlin⁵, Brett Bowman⁵, Matthew Seetin⁵, Mariely DeJesus-Hernandez³, Jazmyne Jackson³, Patricia H Brown³, Dennis W Dickson³, Marka van Blitterswijk³, Rosa Rademakers³, Leonard Petrucelli^{6

7}, John D Fryer^{8

9}

Affiliations

¹ Department of Neuroscience, Mayo Clinic, Jacksonville, FL, 32224, USA. ebbert.mark@mayo.edu.
² Mayo Graduate School, Mayo Clinic, Rochester, MN, 55905, USA. ebbert.mark@mayo.edu.
³ Department of Neuroscience, Mayo Clinic, Jacksonville, FL, 32224, USA.
⁴ Mayo Graduate School, Mayo Clinic, Rochester, MN, 55905, USA.
⁵ Pacific Biosciences, Menlo Park, CA, 94025, USA.
⁶ Department of Neuroscience, Mayo Clinic, Jacksonville, FL, 32224, USA. petrucelli.leonard@mayo.edu.
⁷ Mayo Graduate School, Mayo Clinic, Rochester, MN, 55905, USA. petrucelli.leonard@mayo.edu.
⁸ Department of Neuroscience, Mayo Clinic, Jacksonville, FL, 32224, USA. fryer.john@mayo.edu.
⁹ Mayo Graduate School, Mayo Clinic, Rochester, MN, 55905, USA. fryer.john@mayo.edu.

PMID: 30126445
PMCID: PMC6102925
DOI: 10.1186/s13024-018-0274-4

Long-read sequencing across the C9orf72 'GGGGCC' repeat expansion: implications for clinical use and genetic discovery efforts in human disease

Mark T W Ebbert et al. Mol Neurodegener. 2018.

. 2018 Aug 21;13(1):46.

doi: 10.1186/s13024-018-0274-4.

Authors

Affiliations

¹ Department of Neuroscience, Mayo Clinic, Jacksonville, FL, 32224, USA. ebbert.mark@mayo.edu.
² Mayo Graduate School, Mayo Clinic, Rochester, MN, 55905, USA. ebbert.mark@mayo.edu.
³ Department of Neuroscience, Mayo Clinic, Jacksonville, FL, 32224, USA.
⁴ Mayo Graduate School, Mayo Clinic, Rochester, MN, 55905, USA.
⁵ Pacific Biosciences, Menlo Park, CA, 94025, USA.
⁶ Department of Neuroscience, Mayo Clinic, Jacksonville, FL, 32224, USA. petrucelli.leonard@mayo.edu.
⁷ Mayo Graduate School, Mayo Clinic, Rochester, MN, 55905, USA. petrucelli.leonard@mayo.edu.
⁸ Department of Neuroscience, Mayo Clinic, Jacksonville, FL, 32224, USA. fryer.john@mayo.edu.
⁹ Mayo Graduate School, Mayo Clinic, Rochester, MN, 55905, USA. fryer.john@mayo.edu.

PMID: 30126445
PMCID: PMC6102925
DOI: 10.1186/s13024-018-0274-4

Abstract

Background: Many neurodegenerative diseases are caused by nucleotide repeat expansions, but most expansions, like the C9orf72 'GGGGCC' (G₄C₂) repeat that causes approximately 5-7% of all amyotrophic lateral sclerosis (ALS) and frontotemporal dementia (FTD) cases, are too long to sequence using short-read sequencing technologies. It is unclear whether long-read sequencing technologies can traverse these long, challenging repeat expansions. Here, we demonstrate that two long-read sequencing technologies, Pacific Biosciences' (PacBio) and Oxford Nanopore Technologies' (ONT), can sequence through disease-causing repeats cloned into plasmids, including the FTD/ALS-causing G₄C₂ repeat expansion. We also report the first long-read sequencing data characterizing the C9orf72 G₄C₂ repeat expansion at the nucleotide level in two symptomatic expansion carriers using PacBio whole-genome sequencing and a no-amplification (No-Amp) targeted approach based on CRISPR/Cas9.

Results: Both the PacBio and ONT platforms successfully sequenced through the repeat expansions in plasmids. Throughput on the MinION was a challenge for whole-genome sequencing; we were unable to attain reads covering the human C9orf72 repeat expansion using 15 flow cells. We obtained 8× coverage across the C9orf72 locus using the PacBio Sequel, accurately reporting the unexpanded allele at eight repeats, and reading through the entire expansion with 1324 repeats (7941 nucleotides). Using the No-Amp targeted approach, we attained > 800× coverage and were able to identify the unexpanded allele, closely estimate expansion size, and assess nucleotide content in a single experiment. We estimate the individual's repeat region was > 99% G₄C₂ content, though we cannot rule out small interruptions.

Conclusions: Our findings indicate that long-read sequencing is well suited to characterizing known repeat expansions, and for discovering new disease-causing, disease-modifying, or risk-modifying repeat expansions that have gone undetected with conventional short-read sequencing. The PacBio No-Amp targeted approach may have future potential in clinical and genetic counseling environments. Larger and deeper long-read sequencing studies in C9orf72 expansion carriers will be important to determine heterogeneity and whether the repeats are interrupted by non-G₄C₂ content, potentially mitigating or modifying disease course or age of onset, as interruptions are known to do in other repeat-expansion disorders. These results have broad implications across all diseases where the genetic etiology remains unclear.

Keywords: Amyotrophic lateral sclerosis (ALS); C9orf72; Frontotemporal dementia (FTD); GGGGCC; Genetics; Long-read sequencing; Oxford Nanopore Technologies MinION; PacBio RS II and Sequel; Repeat expansion disorders; Structural mutations.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

The Mayo Clinic Institutional Review Board (IRB) approved all procedures for this study and we followed all appropriate protocols.

Consent for publication

All participants were properly consented for this study.

Competing interests

IM, BB, and MS are full-time employees of Pacific Biosciences of California, Inc. IM’s spouse is also a full-time employee of, and owns stock in, Pacific Biosciences of California, Inc. All other authors declare they have no conflicts of interest.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

**Fig. 1**
Schematic diagrams for plasmids used to test PacBio and ONT long-read sequencing technologies. To minimize biases when comparing the PacBio RS II and ONT MinION, we constructed four plasmids, including three repeat-containing plasmids and a non-repeat-containing plasmid. Each plasmid map identifies estimated plasmid size, and the location and size of the repeat within the plasmid. a The first plasmid did not contain a repeat, as a control, but instead included the *EGFP* gene. The *EGFP* plasmid was linearized at position 2969 with the AvrII restriction enzyme. b We also constructed a plasmid with 62 repeats of the spinocerebellar ataxia type 36 (SCA36) ‘GGCCTG’ repeat, which was linearized at position 2873 with AvrII. c A third plasmid contained 423 *C9orf72* ‘GGGGCC’ repeats, and was linearized at position 6368 with MluI to maximize non-repeat sequence both up and downstream of the plasmid, thus avoiding bias against reads in either direction; allowing the repeat to be too close to either end could compromise sequencing or downstream analyses. d We included an additional plasmid with 774 *C9orf72* ‘GGGGCC’ repeats to simulate the expansion size found in ALS- or FTD-affected expansion carriers. While 774 repeats is dramatically smaller than the expansion found in many affected carriers, it was the largest we were able to construct reliably, because these repeats are unstable in bacteria. Additionally, while we targeted the number of specified repeats for each plasmid, most colonies contained fewer than the targeted repeats because repeats are generally unstable in bacteria (Additional file 1: Figure S1). Thus, the targeted number of repeats serves as an estimated maximum number of repeats. Plasmids were visualized using AngularPlasmid

**Fig. 2**
Workflow for linearizing, pooling, and sequencing plasmids on the PacBio RS II and ONT MinION long-read platforms. Each plasmid was cut with the restriction enzyme identified in the respective plasmid maps (Fig. 1), and at the specified location. After linearizing each plasmid independently, the plasmids were pooled and cleaned. We then sequenced the same pool on the PacBio RS II and Oxford Nanopore Technologies’ (ONT) MinION using their respective library preparation protocols. After sequencing, reads from each plasmid were identified using BLAST and then aligned to their respective reference sequences using graphmap, as preparation for downstream comparisons

**Fig. 3**
Both the PacBio RS II and ONT MinION successfully sequence through repeats, but the RS II had more variable read lengths. After selecting only those reads that could be clearly identified for each plasmid (described in Fig. 1), there were 46,213, 67,339, 9012, and 11,535 PacBio RS II reads for *EGFP*, SCA36, C9-423, and C9-774, respectively. Likewise, there were 26,735, 39,059, 8276, and 8720 ONT MinION reads for the same respective plasmids. The PacBio RS II generally had more reads, but read length distributions are much tighter for the ONT MinION across all four plasmids, and more closely resemble expected read lengths. The median read length for each instrument is indicated by dashed lines, and the expected maximum read length is indicated by a solid gray line. Expected maximum read lengths for each plasmid were 6080 (*EGFP*), 5984 (SCA36), 8813 (C9-423), and 9731 (C9-774). Because these long repeat sequences are unstable in plasmids, however, most bacterial colonies contained fewer than the targeted number of repeats (Additional file 1: Figure S1). Thus we expect the read sizes to vary. The additional PacBio RS II read variability may be related to library preparation

**Fig. 4**
Repeat length distributions for the PacBio RS II and ONT MinION were highly concordant. Both platforms produced highly similar distributions for all plasmids, but the repeat lengths varied widely within each plasmid, as expected based on gel intensity curves (Additional file 1: Figure S1). The C9-423 repeat length distribution is more variable than even the C9-774, perhaps because the C9-774 plasmid backbone is more tolerant of the repeat. The median number of repeats for the PacBio RS II were 35, 148, and 395 for SCA36, C9-423, and C9-774, respectively, while median repeat lengths for the ONT MinION were 37, 172, and 406, respectively. The percentage of reads that extended through the SCA36, C9-423, and C9-774, repeats were approximately 95.9%, 66.8%, and 43.8% for the PacBio RS II, respectively, while 99.5%, 97.7%, and 83.5% of ONT MinION reads extended through, respectively

**Fig. 5**
Characterization of the affected *C9orf72* repeat expansion carriers using standard methodologies. a, d We first performed fluorescent PCR to determine the individuals' non-pathogenic repeat sizes. Genomic DNA was PCR-amplified with genotyping primers and one fluorescently labeled primer. Fragment length analysis of the PCR product was then performed on an ABI3730 DNA analyzer and visualized using GeneMapper software. A peak is observable at 129 bp (a) and 165 bp (d), indicating that the non-pathogenic alleles for samples 1 and 2 contain two and eight repeats, respectively. A single peak also indicates that the individual is either homozygous for the given allele, or also has an expansion. b, e To determine whether the individuals had a repeat expansion, we performed a repeat-primed PCR analysis. PCR products of a repeat-primed PCR were separated on an ABI3730 DNA analyzer and visualized by GeneMapper software, showing a stutter amplification characteristic for a *C9orf72* repeat expansion. This does not indicate expansion size, however. c, e After determining the individuals were expansion carriers, we performed a Southern blot to estimate the size. The Southern blots reveal a long repeat expansion in other individuals for whom cerebellar tissue was available, including positive controls (POS CON; lanes 1–5, and 1 and 3, respectively) and our patients of interest (CASE; lanes six and two, respectively). DIG-labeled DNA Molecular Weight Markers (Roche) are shown to estimate the repeat expansion’s size. Measurements were based on multiple separate Southern blots for each case; for simplicity one representative Southern blot is shown. The most abundant expansion size in samples 1 and 2 are estimated around 1083 (8.8 kb) and 1933 repeats (13.9 kb), respectively. The smears ranged widely, demonstrating the heterogeneity (i.e., mosaicism) of this repeat expansion within a small piece of tissue. This demonstrates the importance of additional long-read sequencing studies to characterize the repeat at the nucleotide level

**Fig. 6**
PacBio Sequel reads traverse the repeat region for pathogenic and non-pathogenic alleles. The PacBio Sequel sequenced through both pathogenic and non-pathogenic alleles, demonstrating the platform is capable of characterizing repeat expansions. All of these reads were first aligned by graphmap, and then hand curated to determine the repeat region. a The human genome reference sequence (hg38) contains three G₄C₂ repeats (18 nucleotides). We identified specific “landmarks” before and after the repeat region in the reference sequence to properly locate the repeat region in the reads, and to hand curate the alignments. Landmarks are identified by red bars adjacent to the repeat region. b We obtained four PacBio Sequel reads covering the eight-repeat sequence, spanning 48 nucleotides. There was a net gain of 29 nucleotides within the defined repeat region, which equates to approximately 5 additional repeats; this concurs with our fragment analysis (Fig. 5a). c We also obtained four reads that covered an expanded allele, one of which bridged the entire repeat expansion, with approximately 1324 repeats (7941 nucleotides). The other three reads ended before bridging the repeat region, where one captured approximately 30 repeats (178 nucleotides), another captured approximately 69 repeats (419 nucleotides), and the third captured approximately 912 repeats (5471 nucleotides)

**Fig. 7**
Whole-genome PacBio Sequel reads aligned to hg38. Whole-genome reads generated using the PacBio Sequel were aligned to human reference genome hg38 using graphmap. We attained 7× genome-wide median coverage and 8× across the *C9orf72* repeat locus. Four reads were from the individual’s wild-type allele of eight repeats, while the other four, were expanded. Three of the four reads capturing an expanded allele did not bridge the entire repeat region, where one captured 178 nucleotides in the repeat region (approximately 30 repeats; red), another captured 419 nucleotides (approximately 69 repeats; blue), and the third captured 5471 nucleotides (approximately 912 repeats; green). The read capturing 419 nucleotides may have bridged the repeat because the end of the read closely matches the sequence adjacent to the repeat region, but was ambiguous (Additional file 1: Figure S3d). The final read spanned the entire repeat with 7941 nucleotides (approximately 1324 repeats; brown), which falls easily within the Southern blot’s range (Fig. 5f). Soft-clipped nucleotides—nucleotides at the end of a read that did not align to the reference—are shown for all reads, and are outlined in green for the read capturing 912 repeats. The approximate location for the repeat expansion is marked by the light-blue lines. A histogram showing read depth per nucleotide is included near the top of the figure. Alignments were visualized using the Integrative Genomics Viewer (IGV)

**Fig. 8**
Repeat length distribution using the PacBio no-amplification (No-Amp) targeted sequencing method. We sequenced sample 1 using the PacBio no-amplification (No-Amp) targeted sequencing method and obtained 828 circular consensus sequences. Approximately 70% (576 of 828) of reads covered the individual’s wild-type allele (two repeats), 14% (115 of 828) were within six nucleotides of two repeats, and 16% (134 of 828) were from expanded alleles. The repeat distribution from the expanded alleles shows two modes at approximately 110 and 870 repeats. Without prior estimates from the Southern blot (Fig. 5c), we likely would have estimated the primary populations of this individual’s repeats at 2, 110, and 870 repeats. Because of the Southern blot, however, we know the primary population of the individual’s expanded repeat is near 1000 repeats, though it is possible the *C9orf72* repeat expansion runs artificially high by Southern Blot because of methylation or high GC content. The 97.5th and 99th percentiles of this distribution’s probability density function are approximately 964 and 1011 repeats, which closely resemble estimates by Southern blot (Fig. 5c)

**Fig. 9**
Schematic of PacBio no-amplification (No-Amp) targeted sequencing. We applied the PacBio no-amplification (No-Amp) Targeted Sequencing method to a *C9orf72* G₄C₂ repeat expansion carrier to better characterize the repeat’s nucleotide content. The No-Amp targeted sequencing method begins with typical SMRTbell library preparation after the target region is excised by restriction enzyme digestion. Cas9 digestion follows with a guide RNA specific to sequence adjacent (Cas9 Cutting Site) to the region of interest (green), leaving the SMRTbells blunt ended. In this case, the guide RNA was specific to sequence upstream (5′) of the G₄C₂ repeat expansion on the anti-sense strand. A new capture adapter (red) is then ligated to the blunt ends and captured using magnetic beads (magbeads). This process enriches the library for reads containing the region of interest to maximize read depth

See this image and copyright information in PMC

References

1. La Spada AR, Taylor JP. Repeat expansion disease: progress and puzzles in disease pathogenesis. Nat Rev Genet. 2010;11:247–258. doi: 10.1038/nrg2748. - DOI - PMC - PubMed
1. Orr HT, Chung M, Banfi S, Kwiatkowski TJ, Servadio A, Beaudet AL, et al. Expansion of an unstable trinucleotide CAG repeat in spinocerebellar ataxia type 1. Nat Genet. 1993;4:221–226. doi: 10.1038/ng0793-221. - DOI - PubMed
1. DeJesus-Hernandez M, Mackenzie IR, Boeve BF, Boxer AL, Baker M, Rutherford NJ, et al. Expanded GGGGCC hexanucleotide repeat in non-coding region of C9ORF72 causes chromosome 9p-linked frontotemporal dementia and amyotrophic lateral sclerosis. Neuron. 2011;72:245–256. doi: 10.1016/j.neuron.2011.09.011. - DOI - PMC - PubMed
1. Kieleczawa J. Fundamentals of sequencing of difficult templates—an overview. J Biomol Tech JBT. 2006;17:207–217. - PMC - PubMed
1. Zhao X, Haqqi T, Yadav SP. Sequencing telomeric DNA template with short tandem repeats using dye terminator cycle sequencing. J Biomol Tech JBT. 2000;11:111–121. - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Long-read sequencing across the C9orf72 'GGGGCC' repeat expansion: implications for clinical use and genetic discovery efforts in human disease

Affiliations

Long-read sequencing across the C9orf72 'GGGGCC' repeat expansion: implications for clinical use and genetic discovery efforts in human disease

Authors

Affiliations

Abstract

Conflict of interest statement

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous