Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Nov 9;14(1):7240.
doi: 10.1038/s41467-023-42636-1.

CamoTSS: analysis of alternative transcription start sites for cellular phenotypes and regulatory patterns from 5' scRNA-seq data

Affiliations

CamoTSS: analysis of alternative transcription start sites for cellular phenotypes and regulatory patterns from 5' scRNA-seq data

Ruiyan Hou et al. Nat Commun. .

Abstract

Five-prime single-cell RNA-seq (scRNA-seq) has been widely employed to profile cellular transcriptomes, however, its power of analysing transcription start sites (TSS) has not been fully utilised. Here, we present a computational method suite, CamoTSS, to precisely identify TSS and quantify its expression by leveraging the cDNA on read 1, which enables effective detection of alternative TSS usage. With various experimental data sets, we have demonstrated that CamoTSS can accurately identify TSS and the detected alternative TSS usages showed strong specificity in different biological processes, including cell types across human organs, the development of human thymus, and cancer conditions. As evidenced in nasopharyngeal cancer, alternative TSS usage can also reveal regulatory patterns including systematic TSS dysregulations.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Developing CamoTSS to identify transcription start site (TSS) from 5' tag-based scRNA-seq data.
A A flow chart of the 5' scRNA-seq gene expression library construction (10x Genomics). B A schematic of CamoTSS which includes clustering, filtering and annotation. “C” denotes cluster. The lines within the cluster circles represent the aligned reads and their start positions are denoted by red circles. C Classifier embedding in CamoTSS includes a logistic regression model and a convolutional neural network model. Ranked ATAC-seq peaks were used as ground truth labels for the TSS clusters when training classifiers.
Fig. 2
Fig. 2. CamoTSS can accurately detect TSS.
A Receiver operating characteristic (ROC) curves for TSS classification with three groups of features by using logistic regressions; the curves are for pooled non-redundant TSSs form iPSC and DMFB datasets (Methods; individual sample shown in Supplementary Fig. S1). Ten-fold cross-validation is used for the evaluation. Source data are provided as a Source Data file. B ROC curves showing using iPSC dataset to predict DMFB dataset and using pre-trained cluster model (with combining iPSC and DMBF) to predict PBMC dataset with paired scATAC-seq and scRNA-seq data. Source data are provided as a Source Data file. C ROC curves for using samples which do not contain binding sites of the CTCF or E2F6 as training datasets to predict samples which only contain binding sites of the CTCF or E2F6. Source data are provided as a Source Data file. D All features (e.g. clusters features and sequence features), the combination dropping one feature in all features and each one feature were fed to the logistic regression model to perform prediction. AUROC values are obtained via 10-fold cross-validation (n = 10). Source data are provided as a Source Data file. E The distributions of RNA POL2, H3K4me3, H3K27ac and H3K36me3 signals around the TSSs detected by us and the random regions produced by bedtools. RNA POL2, H3K4me3 and H3K27ac show enrichment around TSSs while H3K36me3 is enriched downstream of our TSSs. F Tracks plots of two examples (DPH1 and SCP2) show peaks of scRNA-seq, scATC-seq, POL2, H3K27ac and H3K4me3. Red lines denote the location of our detected TSSs. G Pie chart of the percentage of our detected TSS regions/clusters as annotated by reference genome or novel TSSs. H Genomic distribution of the detected TSS regions. Source data are provided as a Source Data file.
Fig. 3
Fig. 3. CamoTSS analysis on TSSs between cell types across 15 human organs.
A UMAP projection of TSS profile (left) and RNA profile (middle and right) in muscle. The T cell cluster is highlighted in the colors. All the other cells are colored gray. B ROC curves for NK/T cell clusters prediction by AUCell scores. Source data are provided as a Source Data file. C The top de novo motifs enriched in the top 500 cluster-specific peaks of S4 and S9. P-values were calculated by using binomial tests. D Heatmap of Pearson’s correlation of expression of common TSS among all cells from all organs. E Venn diagrams of top 20 significant TSS markers and RNA expression markers in 8 cell clusters of the bladder. F UMAP plots of TSS data specific marker. G tSNE plots show alternative TSS marker masked at gene level. H ROC curves for prediction of cell types from the first 20 PCs of TSS matrix by using a random forest in a multi-label classification. Models were evaluated by using 10-fold cross-validation, whereby the overall average is obtained by merging all cell types at a micro level. I Enrichment network representing the top 20 enriched terms of significant alternative TSS. The enriched terms that displayed high similarity were grouped together and presented as a network diagram. In this diagram, each node corresponds to an enriched term and is assigned a color based on its cluster. The size of each node reflects the number of enriched genes, while the thickness of the lines connecting nodes represents the similarity score between the enriched terms.
Fig. 4
Fig. 4. CamoTSS identifies differential alternative TSS usage from nasopharyngeal carcinoma.
A, B UMAP plot of gene-level expression, annotated with cell types (A) and disease status (B). C Volcano plot to show the relationship between ELBO_gain and effect size on logit(PSI) for detecting differential TSS between NPC and NLH patients. Cell_coeff is the effect size on logit(PSI). Positive value means higher PSI in NPC. ELBO_gain denotes the evidence lower bound difference for the two hypotheses (Methods). D Genome track plot of LIMS1 in different cell types of NLH and NPC patients. One horizontal genome track denotes the coverage of all cells in one cell type. E, F Violin plot on example gene LIMS1 for T cell (E; n = 6964 cells for NLH; n = 17,607 cells for NPC) and Myeloids (F; n = 158 cells for NLH; n = 923 cells for NPC) in NLH and NPC patients. The y-axis PSI denotes the proportion of TSS1 (LIMS1-215; minor TSS here) among the top two TSSs in each cell type. G Bar plot showing the enriched terms of genes with differential TSS usage between NLH and NPC patients in the T cell. H WebLogo of the base frequency of MA1929.1 (i.e. one motif of CTCF) enriched in the sequences detected by FIMO (top) and displayed in the JASPAR database (bottom). I Scatter plot of the binding frequency of human TFs on 528 TSS regions elevated in NLH and NPC patients (shown is based on T cells). Source data are provided as a Source Data file. J Box plot of expressed cell proportion of CTCF between NLH (n = 3 patients) and NPC (n = 7 patients). K Heatmap shows the hierarchical clustering of patients by the proportion of expressed cells of TFs that have significant differential binding frequency between NPC and NLH groups (n = 10 patients). The color in the heatmap means the proportion of expressed cells with rescaling to the range of 0 and 1 on row. The up- and down- regulated TFs were displayed in red and blue, respectively (NPC vs NLH). Blue ID **: fold change < 0.6; Blue ID *: 0.6 < fold change < 0.8; red ID *: 1.2 < fold change < 1.5; red ID**: fold change > 1.5. Source data are provided as a Source Data file.
Fig. 5
Fig. 5. CamoTSS detected transcription start site switch from epithelial cells of gastric cancer.
A UMAP visualization of epithelial cells (n = 8485) in gastric cancer. Each dot represents an individual cell, where colors indicate subcell type. B Volcano plot displaying the relationship between ELBO_gain and effect size on logit(PSI) for detecting differential TSS between normal and tumor cells. Cell_coeff is the effect size on logit(PSI). A positive value means higher PSI in normal cells. ELBO_gain denotes the evidence lower bound difference for the two hypotheses (Methods). C Boxplot showing an example gene that has a significant TSS usage between normal (n = 5977) and tumor cells (n = 2508). D Genome track plot of SLC29A1 in normal and tumor cells. One horizontal genome track represents the coverage of all cells within one specific cell type. E Bar plot exhibiting the enriched terms of genes with differential TSS usage between normal and tumor epithelial cells in gastric cancer. Source data are provided as a Source Data file. F Scatter plot of the binding frequency of human TFs from JASPAR database on 453 TSS regions elevated in normal and tumor epithelial cells. Source data are provided as a Source Data file.
Fig. 6
Fig. 6. CamoTSS identifies differential alternative TSS usage from human thymus development.
A, B UMAP visualization of all cells (35,629) in thymus development. Each dot is one cell, with colors coded according to the time points (A) and cell types (B). C Volcano plot showing the relationship between ELBO_gain and effect size on logit(PSI) for detecting differential TSS between week11 and week12 (Top), week11 and month30 (Middle), week12 and month30 (Bottom). Cell_coeff is the effect size on logit(PSI). Positive value means higher PSI in week12 (Top), month30 (Middle) and month30 (Bottom), respectively. ELBO_gain denotes the evidence lower bound difference for the two hypotheses (Methods). D Line chart showing four patterns of example genes which are all significant at three development stage pairs. Data are presented as mean values ± SD (n = 7059 cells for week11; n = 12,249 cells for week12; n = 13854 cells for month30). E Bar plot showing the enriched GO terms of genes with alternative TSS usage between week11 and month30 in the T cell. Source data are provided as a Source Data file. F Illustration of the window sliding algorithm for identifying CTSS within one TSS cluster. Count and fold change parameters were used to filter noise. G Volcano plot between ELBO_gain and effect size on logit(PSI) for detecting differential CTSS between week11 and month30 in T cell. Same figure form as panel (C). Source data are provided as a Source Data file. H Violin plot of PSI value of KTN1 and SETD5 among week11 (n = 7059 cells), week12 (n = 12,249 cells) and month30 (n = 13854 cells). The two farthest CTSSs were picked up to calculate PSI for each gene. I Histogram showing the coverage of reads 1 with unencoded G at the cap obtained from 5' scRNA-seq in KTN1 (Left) and SETD5 (Right). The gray and red lines represent CTSSs identified by CamoTSS, while the red line shows the two farthest CTSS used for differential CTSS analysis with BRIE2.

References

    1. Reyes A, Huber W. Alternative start and termination sites of transcription drive most transcript isoform differences across human tissues. Nucleic Acids Res. 2018;46:582–592. doi: 10.1093/nar/gkx1165. - DOI - PMC - PubMed
    1. Shiozawa Y, et al. Aberrant splicing and defective mRNA production induced by somatic spliceosome mutations in myelodysplasia. Nat. Commun. 2018;9:1–16. doi: 10.1038/s41467-018-06063-x. - DOI - PMC - PubMed
    1. Smart AC, et al. Intron retention is a source of neoepitopes in cancer. Nat. Biotechnol. 2018;36:1056–1058. doi: 10.1038/nbt.4239. - DOI - PMC - PubMed
    1. Horning AM, et al. Single-Cell RNA-seq reveals a subpopulation of prostate cancer cells with enhanced cell-cycle–related transcription and attenuated androgen responseheterogeneous androgen responses of prostate cancer cells. Cancer Res. 2018;78:853–864. doi: 10.1158/0008-5472.CAN-17-1924. - DOI - PMC - PubMed
    1. Wen WX, Mead AJ, Thongjuea S. Technological advances and computational approaches for alternative splicing analysis in single cells. J. Comput. Struct. Biotechnol. 2020;18:332–343. doi: 10.1016/j.csbj.2020.01.009. - DOI - PMC - PubMed

Publication types