Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 May 16:2023.05.15.540865.
doi: 10.1101/2023.05.15.540865.

The ENCODE4 long-read RNA-seq collection reveals distinct classes of transcript structure diversity

Affiliations

The ENCODE4 long-read RNA-seq collection reveals distinct classes of transcript structure diversity

Fairlie Reese et al. bioRxiv. .

Abstract

The majority of mammalian genes encode multiple transcript isoforms that result from differential promoter use, changes in exonic splicing, and alternative 3' end choice. Detecting and quantifying transcript isoforms across tissues, cell types, and species has been extremely challenging because transcripts are much longer than the short reads normally used for RNA-seq. By contrast, long-read RNA-seq (LR-RNA-seq) gives the complete structure of most transcripts. We sequenced 264 LR-RNA-seq PacBio libraries totaling over 1 billion circular consensus reads (CCS) for 81 unique human and mouse samples. We detect at least one full-length transcript from 87.7% of annotated human protein coding genes and a total of 200,000 full-length transcripts, 40% of which have novel exon junction chains. To capture and compute on the three sources of transcript structure diversity, we introduce a gene and transcript annotation framework that uses triplets representing the transcript start site, exon junction chain, and transcript end site of each transcript. Using triplets in a simplex representation demonstrates how promoter selection, splice pattern, and 3' processing are deployed across human tissues, with nearly half of multi-transcript protein coding genes showing a clear bias toward one of the three diversity mechanisms. Evaluated across samples, the predominantly expressed transcript changes for 74% of protein coding genes. In evolution, the human and mouse transcriptomes are globally similar in types of transcript structure diversity, yet among individual orthologous gene pairs, more than half (57.8%) show substantial differences in mechanism of diversification in matching tissues. This initial large-scale survey of human and mouse long-read transcriptomes provides a foundation for further analyses of alternative transcript usage, and is complemented by short-read and microRNA data on the same samples and by epigenome data elsewhere in the ENCODE4 collection.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.. Overview of the ENCODE4 RNA datasets.
a, Overview of the sampled tissues and number of libraries from each tissue in the ENCODE human LR-RNA-seq dataset. b, Percentage of GENCODE v40 polyA genes by gene biotype detected in at least one ENCODE short-read RNA-seq library from samples that match the LR-RNA-seq at > 0 TPM, >=1 TPM, and >= 100 TPM. c, Number of samples in which each GENCODE v40 gene is detected >= 1 TPM in the ENCODE short-read RNA-seq dataset from samples that match the LR-RNA-seq. d, Data processing pipeline for the LR-RNA-seq data. e, Percentage of GENCODE v40 polyA genes by gene biotype detected in at least one ENCODE human LR-RNA-seq library at > 0 TPM, >= 1 TPM, and >= 100 TPM. f, Number of samples in which each GENCODE v40 gene is detected >= 1 TPM in the ENCODE human LR-RNA-seq dataset. g, Boxplot of TPM of polyA genes at the indicated rank in each human LR-RNA-seq library. Not significant (no stars) P > 0.05; *P <= 0.05, **P <= 0.01, ***P <= 0.001, ****P <= 0.0001; Wilcoxon rank-sum test.
Figure 2.
Figure 2.. Triplet annotation of transcript structure maps diversity within and across samples.
a, Representation of structure and transcript triplet naming convention for 3 different transcripts from the same gene based on the transcript start site (TSS), exon junction chain (EC), and transcript end site (TES) used. b-d, Triplet features detected >= 1 TPM in human ENCODE LR-RNA-seq from GENCODE v40 polyA genes broken out by novelty and support. Known features are annotated in GENCODE v29 or v40. Novel supported features are supported by b, CAGE or RAMPAGE c, GTEx, d, PAS-seq or the PolyA Atlas. e-g, Triplet features detected >= 1 TPM in human ENCODE LR-RNA-seq per GENCODE v40 polyA gene split by gene biotype for e, TSSs, f, ECs, g, TESs. h, Number of transcripts from GENCODE v40 polyA genes detected >= 1 TPM from human ENCODE LR-RNA-seq that have a known EC split by gene biotype. i, Novelty characterization of triplet features in each transcript detected >= 1 TPM in the human ENCODE LR-RNA-seq. j, Number of transcripts detected >= 1 TPM in human ENCODE LR-RNA-seq per GENCODE v40 polyA gene split by gene biotype. k, COL1A1 (gene expressed at 548 TPM) transcripts expressed >= 1 TPM in the ovary sample from human ENCODE LR-RNA-seq. l, PKM (gene expressed at 506 TPM) transcripts expressed >= 1 TPM in the ovary sample from human ENCODE LR-RNA-seq colored by expression level (TPM). m, Expression level of gene (TPM) versus the percent isoform (pi) value of the predominant transcript for each gene expressed >= 1 TPM from human ENCODE LR-RNA-seq in the ovary sample. Points are colored by whether or not pi = 100. n, Number of unique predominant transcripts detected >= 1 TPM across samples per gene.
Figure 3.
Figure 3.. The gene structure simplex represents distinct modes of transcript structure diversity across genes and samples.
a, Transcripts for 5 model genes; 1 of each sector (TSS-high, splicing-high, TES-high, mixed, and simple). Table shows the gene triplet, splicing ratio gene triplet, and simplex coordinates that correspond to each toy gene. b, Layout of the gene structure simplex with the genes from a, plotted based on their simplex coordinates. Proportion of TSS usage is the blue axis (left), proportion of TES is the orange axis (bottom), and proportion of splicing ratio is the pink axis (right). Regions of the simplex are colored and labeled based on their sector category (TSS-high, splicing-high, TES-high). Gene triplets that land in each sector are assigned the concordant sector category. c-e, Gene structure simplices for the transcripts from protein coding genes that are c, annotated in GENCODE v40 where the parent gene is also detected in our human LR-RNA-seq dataset, d, the observed set of transcripts, those detected >= 1 TPM in the human ENCODE LR-RNA-seq dataset, e, the observed major set of transcripts, the union of major transcripts from each sample detected >= 1 TPM in the human ENCODE LR-RNA-seq dataset. f-j, Proportion of genes from the GENCODE v40, observed, and observed major sets that fall into the f, TSS-high sector, g, splicing-high sector, h, TES-high sector, i, mixed sector, j, simple sector. k, Gene structure simplex for AKAP8L. Gene triplets with splicing ratio for H9 and H9-derived pancreatic progenitors labeled. Simplex coordinates for the GENCODE v40, observed set, and centroid of the samples also shown for AKAP8L. l-m, Transcripts of AKAP8L expressed >= 1 TPM in l, H9 m, H9-derived pancreatic progenitors colored by expression level in TPM. Alternative exons that differ between transcripts are colored pink.
Figure 4.
Figure 4.. Sample-specific and global changes in predominant and major transcript isoform usage.
a, Gene structure simplex for major transcripts of ELN. Gene triplets with splicing ratio for lung and H9-derived chondrocytes labeled. Simplex coordinates for the GENCODE v40 and observed major set are labeled. b, Major transcripts of ELN expressed >= 1 TPM in lung colored by expression level in TPM. Alternative exons that differ between transcripts are colored pink. c, Gene structure simplex for major transcripts of CTCF. Gene triplets with splicing ratio for lung labeled. Simplex coordinates for the GENCODE v40 and observed major set are labeled. d, From top to bottom: Major transcripts of CTCF expressed >= 1 TPM in lung, TSSs of CTCF major transcripts expressed >= 1 TPM in lung, ENCODE cCREs colored by type. e, Gene structure simplex for E4F1. Gene triplet with splicing ratio for observed E4F1 transcripts labeled. Simplex coordinates for the GENCODE v40 and observed set also shown for E4F1. f, Gene structure simplex for major transcripts of E4F1. Gene triplet with splicing ratio for observed major E4F1 transcripts labeled. Simplex coordinates for the GENCODE v40 and observed major set also shown for E4F1. g, Sector assignment change and conservation for protein coding genes in the human ENCODE LR-RNA-seq dataset between the observed set of gene triplets (left) and the observed major set of gene triplets (right). Percent of genes with the same sector between both sets labeled in the middle. h-k, Percentage of libraries where a gene with an annotated MANE transcript is expressed and the MANE h, transcript i, TSS j, EC k, TES is the predominant transcript or triplet feature.
Figure 5.
Figure 5.. Conservation of gene triplets from human and mouse.
a-e, Proportion of genes from the GENCODE vM25, observed, and observed major sets that fall into the a, TSS-high sector, b, splicing-high sector, c, TES-high sector, d, mixed sector, e, simple sector. f, Gene structure simplex for ARF4 in human. Gene triplet with splicing ratio for ARF4 transcripts in H1 labeled. Simplex coordinates for the GENCODE v40, sample-level centroid, and observed set also shown for ARF4. g, Gene structure simplex for Arf4 in mouse. Gene triplet with splicing ratio for Arf4 transcripts in F121–9 labeled. Simplex coordinates for the GENCODE v40, sample-level centroid, and observed set also shown for Arf4. h, Transcripts of ARF4 expressed >= 1 TPM in human H1 sample colored by expression level in TPM. i, Transcripts of Arf4 expressed >= 1 TPM in mouse F121–9 sample colored by expression level in TPM. j, Sector assignment change and conservation for orthologous protein coding genes between the observed major human set of gene triplets (left) and the observed major mouse set of gene triplets (right). Percent of genes with the same sector between both sets labeled in the middle. k, Sector assignment change and conservation for orthologous protein coding genes between the sample-level H1 major human set of gene triplets (left) and the sample-level F121–9 major mouse set of gene triplets (right). Percent of genes with the same sector between both sets labeled in the middle.

References

    1. Park Eddie, Pan Zhicheng, Zhang Zijun, Lin Lan, and Xing Yi. The Expanding Landscape of Alternative Splicing Variation in Human Populations. The American Journal of Human Genetics, 102(1):11–26, 2018. ISSN 0002–9297. doi: 10.1016/j.ajhg.2017.11.002. - DOI - PMC - PubMed
    1. Di Giammartino Dafne Campigli, Nishida Kensei, and Manley James L.. Mechanisms and Consequences of Alternative Polyadenylation. Molecular Cell, 43(6):853–866, 2011. ISSN 1097–2765. doi: 10.1016/j.molcel.2011.08.017. - DOI - PMC - PubMed
    1. Ara Takeshi, Lopez Fabrice, Ritchie William, Benech Philippe, and Gautheret Daniel. Conservation of alternative polyadenylation patterns in mammalian genes. BMC Genomics, 7 (1):189, 2006. doi: 10.1186/1471-2164-7-189. - DOI - PMC - PubMed
    1. Xing Yi and Lee Christopher. Alternative splicing and RNA selection pressure — evolutionary consequences for eukaryotic genomes. Nature Reviews Genetics, 7(7):499–509, 2006. ISSN 1471–0056. doi: 10.1038/nrg1896. - DOI - PubMed
    1. Nagasaki Hideki, Arita Masanori, Nishizawa Tatsuya, Suwa Makiko, and Gotoh Osamu. Species-specific variation of alternative splicing and transcriptional initiation in six eukaryotes. Gene, 364:53–62, 2005. ISSN 0378–1119. doi: 10.1016/j.gene.2005.07.027. - DOI - PubMed

Publication types