. 2020 Jun;30(6):885-897.

doi: 10.1101/gr.259903.119. Epub 2020 Jul 6.

A long-read RNA-seq approach to identify novel transcripts of very large genes

Prech Uapinyoying^{1

2

3}, Jeremy Goecks⁴, Susan M Knoblach^{1

2}, Karuna Panchapakesan¹, Carsten G Bonnemann^{1

3}, Terence A Partridge^{1

2}, Jyoti K Jaiswal^{1

2}, Eric P Hoffman^{1

5}

Affiliations

¹ Center for Genetic Medicine Research, Children's Research Institute, Children's National Health System, Washington, D.C. 20010, USA.
² Department of Genomics and Precision Medicine, The George Washington University School of Medicine and Health Sciences, Washington, D.C. 20052, USA.
³ Neuromuscular and Neurogenetic Disorders of Childhood Section, National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, Maryland 20892, USA.
⁴ Computational Biology Program, Oregon Health and Science University, Portland, Oregon 97239, USA.
⁵ Department of Pharmaceutical Sciences, School of Pharmacy and Pharmaceutical Sciences, Binghamton University, Binghamton, New York 13902, USA.

PMID: 32660935
PMCID: PMC7370890
DOI: 10.1101/gr.259903.119

A long-read RNA-seq approach to identify novel transcripts of very large genes

Prech Uapinyoying et al. Genome Res. 2020 Jun.

. 2020 Jun;30(6):885-897.

doi: 10.1101/gr.259903.119. Epub 2020 Jul 6.

Authors

Prech Uapinyoying^{1

2

3}, Jeremy Goecks⁴, Susan M Knoblach^{1

2}, Karuna Panchapakesan¹, Carsten G Bonnemann^{1

3}, Terence A Partridge^{1

2}, Jyoti K Jaiswal^{1

2}, Eric P Hoffman^{1

5}

Affiliations

¹ Center for Genetic Medicine Research, Children's Research Institute, Children's National Health System, Washington, D.C. 20010, USA.
² Department of Genomics and Precision Medicine, The George Washington University School of Medicine and Health Sciences, Washington, D.C. 20052, USA.
³ Neuromuscular and Neurogenetic Disorders of Childhood Section, National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, Maryland 20892, USA.
⁴ Computational Biology Program, Oregon Health and Science University, Portland, Oregon 97239, USA.
⁵ Department of Pharmaceutical Sciences, School of Pharmacy and Pharmaceutical Sciences, Binghamton University, Binghamton, New York 13902, USA.

PMID: 32660935
PMCID: PMC7370890
DOI: 10.1101/gr.259903.119

Abstract

RNA-seq is widely used for studying gene expression, but commonly used sequencing platforms produce short reads that only span up to two exon junctions per read. This makes it difficult to accurately determine the composition and phasing of exons within transcripts. Although long-read sequencing improves this issue, it is not amenable to precise quantitation, which limits its utility for differential expression studies. We used long-read isoform sequencing combined with a novel analysis approach to compare alternative splicing of large, repetitive structural genes in muscles. Analysis of muscle structural genes that produce medium (Nrap: 5 kb), large (Neb: 22 kb), and very large (Ttn: 106 kb) transcripts in cardiac muscle, and fast and slow skeletal muscles identified unannotated exons for each of these ubiquitous muscle genes. This also identified differential exon usage and phasing for these genes between the different muscle types. By mapping the in-phase transcript structures to known annotations, we also identified and quantified previously unannotated transcripts. Results were confirmed by endpoint PCR and Sanger sequencing, which revealed muscle-type-specific differential expression of these novel transcripts. The improved transcript identification and quantification shown by our approach removes previous impediments to studies aimed at quantitative differential expression of ultralong transcripts.

PubMed Disclaimer

Figures

**Figure 1.**
Experimental design. (A) Total RNA from cardiac apex, extensor digitorum longus (EDL), and soleus muscles were extracted, and cDNA was generated using barcoded oligo(dT) primers. Next, 5–10 kb cDNAs from all muscles were size selected, pooled for library construction, and sequenced on the PacBio RSII. (B) PacBio's Iso-Seq bioinformatics pipeline. Raw reads are converted into circular consensus (CCS/HiFi) reads. HiFi reads are separated into full-length (FL) and non-FL reads. FL reads must have 5′ and 3′ primers and a poly(A) tail. FL reads are grouped by similarity (isoform), polished using non-FL reads to generate high quality transcript consensus reads, and aligned to the mouse genome. (Reprinted, with permission, from Pacific Biosciences.)

**Figure 2.**
Sashimi plot showing extensive alternative splicing of EDL and soleus transcripts between exons 152 and 137 of nebulin. Splice junctions that skip known exons are highlighted. Three exons between exons 147 and 148 are not annotated in RefSeq. Two of these exons (red) are in GENCODE. Exon u-002 (green) is not annotated in either database. The plot shows consensus reads (not full-length reads). Minimum splice junction coverage set to 5 for visual clarity.

**Figure 3.**
Differential usage of *Nrap* exon 12 (exonic part 038). (A) *Nrap* exon coverage graph (cropped) produced by exCOVator. (*Bottom*) Stacked bar graph showing total full-length (FL) read coverage of all exonic parts (EP) for the gene. (*Top*) Line graph displaying the ratio of [FL reads matching the EP/total FL reads] that overlap the EP coordinates. (B) Sashimi plot from the Integrative Genomics Viewer (IGV) displaying differential splicing of exon 12. The plot displays consensus reads from the BAM file (not FL reads). Minimum splice junction coverage = 5. (C) Agarose gel showing RT-PCR products pertaining to exon 12 from soleus, EDL, and heart. Primers target exons 9–14 (549 bp includes and 444 bp excludes exon 12). (D) Sanger sequencing of products excised from the gel in C. The *top* half shows sequences aligned using the IGV BLAT tool. Cardiac band 1 and EDL and soleus band 2 are missing exon 12.

**Figure 4.**
Differential usage of exon 138 of mouse nebulin. (A) Nebulin exon coverage graph (cropped) produced by exCOVator. (*Bottom*) Stacked bar graph displays full-length (FL) read coverage of exonic parts (EP). (*Top*) Line graph displays the ratio of [FL reads matching the EP/total FL reads] that overlap the EP coordinates. (B) Sashimi plot from the Integrative Genomics Viewer (IGV) displaying differential splicing of exon 138. The plot displays consensus reads from the BAM file (not FL reads). Minimum splice junction coverage = 5. (C) Agarose gel showing RT-PCR products pertaining to exon 138 from soleus, EDL, and heart. Primers target exon 135 and 139. Cardiac shows similar banding pattern as soleus; however, few reads were detected during sequencing. (D) Sanger sequencing of products cut from the gel seen in C. The *top* portion shows sequences aligned in IGV using the BLAT tool. Only the EDL band 2 is missing exon 138.

**Figure 5.**
Titin exon 191, an undocumented cassette exon removed from a subset of cardiac transcripts. (A) Titin exon coverage graph (cropped) produced by exCOVator. (*Top*) Line graph; the red arrow points to exonic part (EP) 129 or exon 191 of NM_011652 (N2-A). (*Bottom*) Stacked bar graph displays full-length (FL) read coverage of EP. (B) Sashimi plot showing the same data in the Integrative Genomics Viewer (IGV). The plot shows consensus reads from the BAM file (not FL reads). Minimum splice junction coverage = 5. (C) Agarose gel of RT-PCR products from soleus, EDL, and heart. Primers target exons 189 and 194 with predicted product sizes of 882 bp (+exon 191, C1) and 615 bp (no exon 191, C3). All soleus and EDL transcripts include exon 191. Cardiac has both predicted bands and one unknown band in the middle (C2) that is likely a heteroduplex. (D) Sanger sequencing data of products from C displayed using the IGV BLAT tool. Exon 191 is missing only from cardiac band 3 (C3) and is present in all other tissues.

**Figure 6.**
Identifying and quantifying *Nrap*, titin, and nebulin transcript isoforms using exPhaser. Exons highlighted as red squares were validated using RT-PCR and Sanger sequencing. (A) *Nrap* transcript structures. Exon numbering based on NM_008733. (B) Titin transcript structures. Exon numbering based on NM_011652. The *bottom* splice pattern could not be uniquely mapped to a single annotation. (C) Nebulin transcript structures. Exon numbering based on NM_010889. Some neighboring constitutively expressed exons were also included for clarity.

**Figure 7.**
Phasing and quantifying additional titin transcript structures using exPhaser. Exons are numbered by NM_011652 (N2-A) unless noted. (A) Focused phasing of skeletal muscle N2-A isoform defining exons (47, 48, 49, 50) and one cardiac N2-B exon (45*). (B) Focused phasing of cardiac N2-B isoform defining exons (45, 46, 168, 169) and two N2-A defining exons (47 and 167). (C) Phasing of exon 312, a cassette exon only spliced out of skeletal muscle transcripts. (D) Phasing of exon 45^‡ of ENSMUST00000099980, an alternate 3′ terminal exon. (E) Phasing of exons 11–13. The most abundant isoforms are unannotated. Transcript structures listed with multiple accession numbers cannot be uniquely assigned using the exons provided.

See this image and copyright information in PMC

References

1. Anders S, Reyes A, Huber W. 2012. Detecting differential usage of exons from RNA-seq data. Genome Res 22: 2008–2017. 10.1101/gr.133744.111 - DOI - PMC - PubMed
1. Anders S, Pyl PT, Huber W. 2015. HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics 31: 166–169. 10.1093/bioinformatics/btu638 - DOI - PMC - PubMed
1. Bang M, Chen J. 2015. Roles of nebulin family members in the heart. Circ J 79: 2081–2087. 10.1253/circj.CJ-15-0854 - DOI - PubMed
1. Bang ML, Centner T, Fornoff F, Geach AJ, Gotthardt M, McNabb M, Witt CC, Labeit D, Gregorio CC, Granzier H, et al. 2001. The complete gene sequence of titin, expression of an unusual ≈700-kDa titin isoform, and its interaction with obscurin identify a novel Z-line to I-band linking system. Circ Res 89: 1065–1072. 10.1161/hh2301.100981 - DOI - PubMed
1. Bolger AM, Lohse M, Usadel B. 2014. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30: 2114–2120. 10.1093/bioinformatics/btu170 - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- Mouse Genome Informatics (MGI)
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A long-read RNA-seq approach to identify novel transcripts of very large genes

Affiliations

A long-read RNA-seq approach to identify novel transcripts of very large genes

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Molecular Biology Databases