Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Oct 13;11(1):5148.
doi: 10.1038/s41467-020-18976-7.

Single-cell RNA cap and tail sequencing (scRCAT-seq) reveals subtype-specific isoforms differing in transcript demarcation

Affiliations

Single-cell RNA cap and tail sequencing (scRCAT-seq) reveals subtype-specific isoforms differing in transcript demarcation

Youjin Hu et al. Nat Commun. .

Abstract

The differences in transcription start sites (TSS) and transcription end sites (TES) among gene isoforms can affect the stability, localization, and translation efficiency of mRNA. Gene isoforms allow a single gene diverse functions across different cell types, and isoform dynamics allow different functions over time. However, methods to efficiently identify and quantify RNA isoforms genome-wide in single cells are still lacking. Here, we introduce single cell RNA Cap And Tail sequencing (scRCAT-seq), a method to demarcate the boundaries of isoforms based on short-read sequencing, with higher efficiency and lower cost than existing long-read sequencing methods. In conjunction with machine learning algorithms, scRCAT-seq demarcates RNA transcripts with unprecedented accuracy. We identified hundreds of previously uncharacterized transcripts and thousands of alternative transcripts for known genes, revealed cell-type specific isoforms for various cell types across different species, and generated a cell atlas of isoform dynamics during the development of retinal cones.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Overview of scRCAT-seq.
a Schematic of the scRCAT-seq method. Full-length cDNA was synthesized by template-switching reverse transcription, amplified by PCR, and tagmented with Tn5 transposases. The TAG added to both ends contains the UMI (unique molecular identifier) and CI (cell identifier). Both 5′ and 3′ ends of the cDNA were captured and amplified by PCR, producing indexed libraries for pooled sequencing. Sequencing data were processed and transcription start sites (TSSs) and transcription end sites (TESs) were identified using machine learning models. CS1: common sequence 1; CS2: common sequence 2; TSO: Template-switching oligo; T30: 30 repeating T bases. b Schematic of the machine learning models. Features were collected based on characteristics related to the peaks, including the read distribution, motifs related to real TSSs/TESs, and sequence features related to internal false-positive signals, and used to train RF, LR, SVM, and KNN models. c Gene body coverage of scRCAT-seq reads derived from DRG (n = 18). Shown is the mean coverage of reads shaded by 95% confidence intervals. d Accuracy in identifying authentic TSSs and TESs with different machine learning models. Error bars represent standard deviation of the mean (n = 3). e Distance of the identified TSSs/TESs to those annotated in hg38. TSSs/TESs were identified from the scRCAT-seq peaks derived from hESC with the RF model. f Pie chart illustrating the distribution of the identified TSSs in hESC relative to the TSSs in the FANTOM5 database. The total number of TSS peaks identified after optimization by the machine learning models is indicated under the pie chart. g Pie chart illustrating the distribution of the identified TSSs in hESC relative to the TESs in PolyA_DB3. Source data are provided as a Source data file.
Fig. 2
Fig. 2. Identification of novel transcripts and isoforms in single cells.
a The number of transcripts with both ends captured using scRCAT-seq (n = 34), Smart-seq2 (n = 12), or ScISOr-seq (n = 8), versus cost. Shown is the mean number of transcripts shaded by 95% confidence intervals. b Comparison between scRCAT-seq (n = 10) and Smart-seq2 (n = 10) in terms of the ratio of reads covering the 5′ end of transcripts (5-bp range to the end). Significance was computed using two-sided Wilcoxon test. The boxplot shows the median as center line, the interquartile range (IQR) as a box, the whiskers indicate 1.5 × IQR and the outliers as points. c The cost of scRCAT-seq (n = 18) and ScISOr-seq (n = 8) for detection of 1000 transcripts. Significance was computed using two-sided Wilcoxon test. The boxplot shows the median as center line, the interquartile range (IQR) as a box, the whiskers indicate 1.5 × IQR and the outliers as points. d Violin plots comparing the expression level between genes detected by scRCAT-seq (n = 3) and ScISOr-seq (n = 3). Gene expression levels were quantified by Smart-seq2 RPM value. Significance was computed using two-sided Wilcoxon test. e Barplot showing the number of novel isoforms of annotated genes and novel, unannotated transcripts in mouse oocytes. The number of transcripts for each category is indicated above the box. Error bars represent standard deviation of the mean (n = 3). f Barplot showing the number of novel isoforms of annotated genes and novel, unannotated transcripts in mouse DRG. Error bars represent standard deviation of the mean (n = 3). g Venn diagram for novel transcripts detected concordantly by scRCAT-seq, Smart-seq2, and ScISOr-seq. h Genome browser track for an example of a novel gene with alternative polyadenylation sites on a different exon. i Gel image showing validation result of novel gene in (h). Experiments were repeated three times with similar results. Source data are provided as a Source Data file.
Fig. 3
Fig. 3. Quantification of RNA isoforms with alternative TSSs and TESs.
a Scatterplot of observed transcript expression levels (y-axis) and true abundance (x-axis) of ERCC spike-ins through 5′-end quantification (n = 92). Each point represents a transcript. The Pearson’s correlation coefficient is shown in the upper right corner. b Scatterplot shows the Pearson’s correlation of transcriptional level of isoforms between replicated pools of three single cells. c Heatmap for Pearson’s correlation coefficient of transcriptomes of DRG neuron and oocytes, based on 5′-end quantification of RNA isoforms. d Heatmap showing RNA isoforms of alternative TSS choices with cell-type specificity. The major isoforms either in oocytes or in DRG neurons are shown (n = 372 isoforms). e Genome browser tracks showing the alternative choices of TSS of Tse22d1 between oocytes and DRG neurons. f Squared coefficients of variation of scRCAT-seq (n = 4) and ScISOr-seq (n = 4), versus the means of normalized read counts. Shown is the mean of squared coefficients of variation shaded by 95% confidence intervals. Source data are provided as a Source data file.
Fig. 4
Fig. 4. Isoform dynamics during human cone development.
a Outline of the high-throughput scRCAT-seq. b Trajectory plot showing the distribution of TES and TSS data on the trajectory of cone development. Each dot represents a single cell, either from TSS or TES data. c Trajectory for the development of cone from RPC generated by using pseudotime analysis with RPC, PR precursor, and photoreceptor cone data. The numbers below show the trajectory divided into stages, to assess the isoform dynamics. d Expression data with isoform specificity reveals differential TSS choices (left) and TES choices (right) between cone and RPC. e Venn diagram of genes with alternative TSSs and with alternative TESs. Significance was computed using two-sided hypergeometric test. f Dynamics of the ratio of major isoforms during the development of cones from RPC. Examples of isoforms with significant differential choices of TSS/TES between RPC and cone are shown, with dynamics for TSS and TES choices in upper and lower panel respectively. The color shows the logNormalized ratio of major isoforms of RPC in each stage. g Genome browser track showing the representative gene CCND1, where two isoforms differ by a switched TES choice over the time course of cone development. Source data are provided as a Source data file.

References

    1. Trapnell C. Defining cell types and states with single-cell genomics. Genome Res. 2015;25:1491–1498. doi: 10.1101/gr.190595.115. - DOI - PMC - PubMed
    1. Wagner A, Regev A, Yosef N. Revealing the vectors of cellular identity with single-cell genomics. Nat. Biotechnol. 2016;34:1145–1160. doi: 10.1038/nbt.3711. - DOI - PMC - PubMed
    1. Tang F, et al. mRNA-Seq whole-transcriptome analysis of a single cell. Nat. Methods. 2009;6:377–382. doi: 10.1038/nmeth.1315. - DOI - PubMed
    1. Regev A, et al. The human cell atlas. eLife. 2017;6:e27041. doi: 10.7554/eLife.27041. - DOI - PMC - PubMed
    1. Noseda M, Harding SE. Understanding dynamic tissue organization by studying the human body one cell at a time: the human cell atlas (HCA) project. Cardiovasc. Res. 2018;114:e93–e95. doi: 10.1093/cvr/cvy223. - DOI - PMC - PubMed

Publication types

LinkOut - more resources