This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2023 May 16:2023.05.15.540865.

doi: 10.1101/2023.05.15.540865.

The ENCODE4 long-read RNA-seq collection reveals distinct classes of transcript structure diversity

Fairlie Reese^{1

2}, Brian Williams³, Gabriela Balderrama-Gutierrez^{1

2}, Dana Wyman², Muhammed Hasan Çelik², Elisabeth Rebboah², Narges Rezaie², Diane Trout³, Milad Razavi-Mohseni^{4

5}, Yunzhe Jiang^{6

7}, Beatrice Borsari^{6

7

8}, Samuel Morabito², Heidi Yahan Liang², Cassandra J McGill^{1

2}, Sorena Rahmanian², Jasmine Sakr^{2

9}, Shan Jiang^{1

2}, Weihua Zeng^{1

2}, Klebea Carvalho², Annika K Weimer¹⁰, Louise A Dionne¹¹, Ariel McShane^{12

13}, Karan Bedi^{14

15}, Shaimae I Elhajjajy¹⁶, Sean Upchurch³, Jennifer Jou¹⁰, Ingrid Youngworth¹⁰, Idan Gabdank¹⁰, Paul Sud¹⁰, Otto Jolanki¹⁰, J Seth Strattan¹⁰, Meenakshi S Kagda¹⁰, Michael P Snyder¹⁰, Ben C Hitz¹⁰, Jill E Moore¹⁶, Zhiping Weng¹⁶, David Bennett^{17

18}, Laura Reinholdt¹¹, Mats Ljungman^{15

19}, Michael A Beer^{4

5}, Mark B Gerstein^{6

7

20

21

22}, Lior Pachter^{3

23}, Roderic Guigó^{8

24}, Barbara J Wold³, Ali Mortazavi^{1

2}

Affiliations

¹ Developmental and Cell Biology, University of California, Irvine, Irvine, USA.
² Center for Complex Biological Systems, University of California, Irvine, Irvine, USA.
³ Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, USA.
⁴ Department of Biomedical Engineering, Johns Hopkins University, Baltimore, USA.
⁵ McKusick-Nathans Department of Genetic Medicine, Johns Hopkins University, Baltimore, USA.
⁶ Program in Computational Biology and Bioinformatics, Yale University, New Haven, USA.
⁷ Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, USA.
⁸ Centre for Genomic Regulation, The Barcelona Institute of Science and Technology, Barcelona, Spain.
⁹ Department of Pharmaceutical Sciences, University of California, Irvine, Irvine, USA.
¹⁰ Department of Genetics, Stanford University School of Medicine, Palo Alto, USA.
¹¹ The Jackson Laboratory, The Jackson Laboratory, Bar Harbor, USA.
¹² Cellular and Molecular Biology Program, University of Michigan, Ann Arbor, USA.
¹³ Department of Radiation Oncology, University of Michigan, Ann Arbor, USA.
¹⁴ Department of Biostatistics, University of Michigan, Ann Arbor, USA.
¹⁵ Center for RNA Biomedicine and Rogel Cancer Center, University of Michigan, Ann Arbor, USA.
¹⁶ Program in Bioinformatics and Integrative Biology, University of Massachusetts Chan Medical School, Worcester, USA.
¹⁷ Rush Alzheimer's Disease Center, Rush University Medical Center, Chicago, USA.
¹⁸ Department of Neurological Sciences, Rush University Medical Center, Chicago, USA.
¹⁹ Departments of Radiation Oncology and Environmental Health Sciences, University of Michigan, Ann Arbor, USA.
²⁰ Section on Biomedical Informatics and Data Science, Yale University, New Haven, USA.
²¹ Department of Statistics and Data Science, Yale University, New Haven, USA.
²² Department of Computer Science, Yale University, New Haven, USA.
²³ Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, USA.
²⁴ Department of Medicine and Life Sciences, Universitat Pompeu Fabra, Barcelona, Spain.

PMID: 37292896
PMCID: PMC10245583
DOI: 10.1101/2023.05.15.540865

The ENCODE4 long-read RNA-seq collection reveals distinct classes of transcript structure diversity

Fairlie Reese et al. bioRxiv. 2023.

[Preprint]. 2023 May 16:2023.05.15.540865.

doi: 10.1101/2023.05.15.540865.

Authors

Affiliations

¹ Developmental and Cell Biology, University of California, Irvine, Irvine, USA.
² Center for Complex Biological Systems, University of California, Irvine, Irvine, USA.
³ Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, USA.
⁴ Department of Biomedical Engineering, Johns Hopkins University, Baltimore, USA.
⁵ McKusick-Nathans Department of Genetic Medicine, Johns Hopkins University, Baltimore, USA.
⁶ Program in Computational Biology and Bioinformatics, Yale University, New Haven, USA.
⁷ Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, USA.
⁸ Centre for Genomic Regulation, The Barcelona Institute of Science and Technology, Barcelona, Spain.
⁹ Department of Pharmaceutical Sciences, University of California, Irvine, Irvine, USA.
¹⁰ Department of Genetics, Stanford University School of Medicine, Palo Alto, USA.
¹¹ The Jackson Laboratory, The Jackson Laboratory, Bar Harbor, USA.
¹² Cellular and Molecular Biology Program, University of Michigan, Ann Arbor, USA.
¹³ Department of Radiation Oncology, University of Michigan, Ann Arbor, USA.
¹⁴ Department of Biostatistics, University of Michigan, Ann Arbor, USA.
¹⁵ Center for RNA Biomedicine and Rogel Cancer Center, University of Michigan, Ann Arbor, USA.
¹⁶ Program in Bioinformatics and Integrative Biology, University of Massachusetts Chan Medical School, Worcester, USA.
¹⁷ Rush Alzheimer's Disease Center, Rush University Medical Center, Chicago, USA.
¹⁸ Department of Neurological Sciences, Rush University Medical Center, Chicago, USA.
¹⁹ Departments of Radiation Oncology and Environmental Health Sciences, University of Michigan, Ann Arbor, USA.
²⁰ Section on Biomedical Informatics and Data Science, Yale University, New Haven, USA.
²¹ Department of Statistics and Data Science, Yale University, New Haven, USA.
²² Department of Computer Science, Yale University, New Haven, USA.
²³ Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, USA.
²⁴ Department of Medicine and Life Sciences, Universitat Pompeu Fabra, Barcelona, Spain.

PMID: 37292896
PMCID: PMC10245583
DOI: 10.1101/2023.05.15.540865

Abstract

The majority of mammalian genes encode multiple transcript isoforms that result from differential promoter use, changes in exonic splicing, and alternative 3' end choice. Detecting and quantifying transcript isoforms across tissues, cell types, and species has been extremely challenging because transcripts are much longer than the short reads normally used for RNA-seq. By contrast, long-read RNA-seq (LR-RNA-seq) gives the complete structure of most transcripts. We sequenced 264 LR-RNA-seq PacBio libraries totaling over 1 billion circular consensus reads (CCS) for 81 unique human and mouse samples. We detect at least one full-length transcript from 87.7% of annotated human protein coding genes and a total of 200,000 full-length transcripts, 40% of which have novel exon junction chains. To capture and compute on the three sources of transcript structure diversity, we introduce a gene and transcript annotation framework that uses triplets representing the transcript start site, exon junction chain, and transcript end site of each transcript. Using triplets in a simplex representation demonstrates how promoter selection, splice pattern, and 3' processing are deployed across human tissues, with nearly half of multi-transcript protein coding genes showing a clear bias toward one of the three diversity mechanisms. Evaluated across samples, the predominantly expressed transcript changes for 74% of protein coding genes. In evolution, the human and mouse transcriptomes are globally similar in types of transcript structure diversity, yet among individual orthologous gene pairs, more than half (57.8%) show substantial differences in mechanism of diversification in matching tissues. This initial large-scale survey of human and mouse long-read transcriptomes provides a foundation for further analyses of alternative transcript usage, and is complemented by short-read and microRNA data on the same samples and by epigenome data elsewhere in the ENCODE4 collection.

PubMed Disclaimer

Figures

**Figure 1.. Overview of the ENCODE4 RNA datasets.**
a, Overview of the sampled tissues and number of libraries from each tissue in the ENCODE human LR-RNA-seq dataset. b, Percentage of GENCODE v40 polyA genes by gene biotype detected in at least one ENCODE short-read RNA-seq library from samples that match the LR-RNA-seq at > 0 TPM, >=1 TPM, and >= 100 TPM. c, Number of samples in which each GENCODE v40 gene is detected >= 1 TPM in the ENCODE short-read RNA-seq dataset from samples that match the LR-RNA-seq. d, Data processing pipeline for the LR-RNA-seq data. e, Percentage of GENCODE v40 polyA genes by gene biotype detected in at least one ENCODE human LR-RNA-seq library at > 0 TPM, >= 1 TPM, and >= 100 TPM. f, Number of samples in which each GENCODE v40 gene is detected >= 1 TPM in the ENCODE human LR-RNA-seq dataset. g, Boxplot of TPM of polyA genes at the indicated rank in each human LR-RNA-seq library. Not significant (no stars) P > 0.05; *P <= 0.05, **P <= 0.01, ***P <= 0.001, ****P <= 0.0001; Wilcoxon rank-sum test.

**Figure 2.. Triplet annotation of transcript structure maps diversity within and across samples.**
a, Representation of structure and transcript triplet naming convention for 3 different transcripts from the same gene based on the transcript start site (TSS), exon junction chain (EC), and transcript end site (TES) used. **b-d,** Triplet features detected >= 1 TPM in human ENCODE LR-RNA-seq from GENCODE v40 polyA genes broken out by novelty and support. Known features are annotated in GENCODE v29 or v40. Novel supported features are supported by b, CAGE or RAMPAGE c, GTEx, d, PAS-seq or the PolyA Atlas. **e-g,** Triplet features detected >= 1 TPM in human ENCODE LR-RNA-seq per GENCODE v40 polyA gene split by gene biotype for e, TSSs, f, ECs, g, TESs. h, Number of transcripts from GENCODE v40 polyA genes detected >= 1 TPM from human ENCODE LR-RNA-seq that have a known EC split by gene biotype. i, Novelty characterization of triplet features in each transcript detected >= 1 TPM in the human ENCODE LR-RNA-seq. j, Number of transcripts detected >= 1 TPM in human ENCODE LR-RNA-seq per GENCODE v40 polyA gene split by gene biotype. k, *COL1A1* (gene expressed at 548 TPM) transcripts expressed >= 1 TPM in the ovary sample from human ENCODE LR-RNA-seq. l, *PKM* (gene expressed at 506 TPM) transcripts expressed >= 1 TPM in the ovary sample from human ENCODE LR-RNA-seq colored by expression level (TPM). m, Expression level of gene (TPM) versus the percent isoform (pi) value of the predominant transcript for each gene expressed >= 1 TPM from human ENCODE LR-RNA-seq in the ovary sample. Points are colored by whether or not pi = 100. n, Number of unique predominant transcripts detected >= 1 TPM across samples per gene.

**Figure 3.. The gene structure simplex represents distinct modes of transcript structure diversity across genes and samples.**
a, Transcripts for 5 model genes; 1 of each sector (TSS-high, splicing-high, TES-high, mixed, and simple). Table shows the gene triplet, splicing ratio gene triplet, and simplex coordinates that correspond to each toy gene. b, Layout of the gene structure simplex with the genes from a, plotted based on their simplex coordinates. Proportion of TSS usage is the blue axis (left), proportion of TES is the orange axis (bottom), and proportion of splicing ratio is the pink axis (right). Regions of the simplex are colored and labeled based on their sector category (TSS-high, splicing-high, TES-high). Gene triplets that land in each sector are assigned the concordant sector category. **c-e,** Gene structure simplices for the transcripts from protein coding genes that are c, annotated in GENCODE v40 where the parent gene is also detected in our human LR-RNA-seq dataset, d, the observed set of transcripts, those detected >= 1 TPM in the human ENCODE LR-RNA-seq dataset, e, the observed major set of transcripts, the union of major transcripts from each sample detected >= 1 TPM in the human ENCODE LR-RNA-seq dataset. **f-j,** Proportion of genes from the GENCODE v40, observed, and observed major sets that fall into the f, TSS-high sector, g, splicing-high sector, h, TES-high sector, i, mixed sector, j, simple sector. k, Gene structure simplex for *AKAP8L*. Gene triplets with splicing ratio for H9 and H9-derived pancreatic progenitors labeled. Simplex coordinates for the GENCODE v40, observed set, and centroid of the samples also shown for *AKAP8L*. **l-m,** Transcripts of *AKAP8L* expressed >= 1 TPM in l, H9 m, H9-derived pancreatic progenitors colored by expression level in TPM. Alternative exons that differ between transcripts are colored pink.

**Figure 4.. Sample-specific and global changes in predominant and major transcript isoform usage.**
a, Gene structure simplex for major transcripts of *ELN*. Gene triplets with splicing ratio for lung and H9-derived chondrocytes labeled. Simplex coordinates for the GENCODE v40 and observed major set are labeled. b, Major transcripts of *ELN* expressed >= 1 TPM in lung colored by expression level in TPM. Alternative exons that differ between transcripts are colored pink. c, Gene structure simplex for major transcripts of *CTCF*. Gene triplets with splicing ratio for lung labeled. Simplex coordinates for the GENCODE v40 and observed major set are labeled. d, From top to bottom: Major transcripts of *CTCF* expressed >= 1 TPM in lung, TSSs of *CTCF* major transcripts expressed >= 1 TPM in lung, ENCODE cCREs colored by type. e, Gene structure simplex for *E4F1*. Gene triplet with splicing ratio for observed E4F1 transcripts labeled. Simplex coordinates for the GENCODE v40 and observed set also shown for *E4F1*. f, Gene structure simplex for major transcripts of *E4F1*. Gene triplet with splicing ratio for observed major *E4F1* transcripts labeled. Simplex coordinates for the GENCODE v40 and observed major set also shown for *E4F1*. g, Sector assignment change and conservation for protein coding genes in the human ENCODE LR-RNA-seq dataset between the observed set of gene triplets (left) and the observed major set of gene triplets (right). Percent of genes with the same sector between both sets labeled in the middle. **h-k,** Percentage of libraries where a gene with an annotated MANE transcript is expressed and the MANE h, transcript i, TSS j, EC k, TES is the predominant transcript or triplet feature.

**Figure 5.. Conservation of gene triplets from human and mouse.**
**a-e,** Proportion of genes from the GENCODE vM25, observed, and observed major sets that fall into the a, TSS-high sector, b, splicing-high sector, c, TES-high sector, d, mixed sector, e, simple sector. f, Gene structure simplex for *ARF4* in human. Gene triplet with splicing ratio for *ARF4* transcripts in H1 labeled. Simplex coordinates for the GENCODE v40, sample-level centroid, and observed set also shown for *ARF4*. g, Gene structure simplex for *Arf4* in mouse. Gene triplet with splicing ratio for *Arf4* transcripts in F121–9 labeled. Simplex coordinates for the GENCODE v40, sample-level centroid, and observed set also shown for *Arf4*. h, Transcripts of *ARF4* expressed >= 1 TPM in human H1 sample colored by expression level in TPM. i, Transcripts of *Arf4* expressed >= 1 TPM in mouse F121–9 sample colored by expression level in TPM. j, Sector assignment change and conservation for orthologous protein coding genes between the observed major human set of gene triplets (left) and the observed major mouse set of gene triplets (right). Percent of genes with the same sector between both sets labeled in the middle. k, Sector assignment change and conservation for orthologous protein coding genes between the sample-level H1 major human set of gene triplets (left) and the sample-level F121–9 major mouse set of gene triplets (right). Percent of genes with the same sector between both sets labeled in the middle.

See this image and copyright information in PMC

References

1. Park Eddie, Pan Zhicheng, Zhang Zijun, Lin Lan, and Xing Yi. The Expanding Landscape of Alternative Splicing Variation in Human Populations. The American Journal of Human Genetics, 102(1):11–26, 2018. ISSN 0002–9297. doi: 10.1016/j.ajhg.2017.11.002. - DOI - PMC - PubMed
1. Di Giammartino Dafne Campigli, Nishida Kensei, and Manley James L.. Mechanisms and Consequences of Alternative Polyadenylation. Molecular Cell, 43(6):853–866, 2011. ISSN 1097–2765. doi: 10.1016/j.molcel.2011.08.017. - DOI - PMC - PubMed
1. Ara Takeshi, Lopez Fabrice, Ritchie William, Benech Philippe, and Gautheret Daniel. Conservation of alternative polyadenylation patterns in mammalian genes. BMC Genomics, 7 (1):189, 2006. doi: 10.1186/1471-2164-7-189. - DOI - PMC - PubMed
1. Xing Yi and Lee Christopher. Alternative splicing and RNA selection pressure — evolutionary consequences for eukaryotic genomes. Nature Reviews Genetics, 7(7):499–509, 2006. ISSN 1471–0056. doi: 10.1038/nrg1896. - DOI - PubMed
1. Nagasaki Hideki, Arita Masanori, Nishizawa Tatsuya, Suwa Makiko, and Gotoh Osamu. Species-specific variation of alternative splicing and transcriptional initiation in six eukaryotes. Gene, 364:53–62, 2005. ISSN 0378–1119. doi: 10.1016/j.gene.2005.07.027. - DOI - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

The ENCODE4 long-read RNA-seq collection reveals distinct classes of transcript structure diversity

Affiliations

The ENCODE4 long-read RNA-seq collection reveals distinct classes of transcript structure diversity

Authors

Affiliations

Abstract

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources