Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Apr 3;21(4):e1012878.
doi: 10.1371/journal.pcbi.1012878. eCollection 2025.

Predicting and comparing transcription start sites in single cell populations

Affiliations

Predicting and comparing transcription start sites in single cell populations

Shiwei Fu et al. PLoS Comput Biol. .

Abstract

The advent of 5' single-cell RNA sequencing (scRNA-seq) technologies offers unique opportunities to identify and analyze transcription start sites (TSSs) at a single-cell resolution. These technologies have the potential to uncover the complexities of transcription initiation and alternative TSS usage across different cell types and conditions. Despite the emergence of computational methods designed to analyze 5' RNA sequencing data, current methods often lack comparative evaluations in single-cell contexts and are predominantly tailored for paired-end data, neglecting the potential of single-end data. This study introduces scTSS, a computational pipeline developed to bridge this gap by accommodating both paired-end and single-end 5' scRNA-seq data. scTSS enables joint analysis of multiple single-cell samples, starting with TSS cluster prediction and quantification, followed by differential TSS usage analysis. It employs a Binomial generalized linear mixed model to accurately and efficiently detect differential TSS usage. We demonstrate the utility of scTSS through its application in analyzing transcriptional initiation from single-cell data of two distinct diseases. The results illustrate scTSS's ability to discern alternative TSS usage between different cell types or biological conditions and to identify cell subpopulations characterized by unique TSS-level expression profiles.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. An overview of the scTSS toolkit. (A) A schematic comparison between the paired-end data (on-site data; left) and single-end data (near-site data; right). In the on-site data, Read 1 directly captures the genuine TSS. In the near-site data, Read 1 does not directly capture the genuine TSS, but there exists a predictable relationship between Read 2 and the genuine TSS, facilitating its estimation. (B) The workflow of the scTSS toolkit. scTSS consists of two major steps, TSS cluster prediction and quantification, followed by differential TSS usage (DU) analysis. In the prediction and quantification step, the toolkit takes in either predicted TSS clusters based on on-site reads, or predicted near-site clusters based on near-site reads, and eventually outputs cell-level count matrices across samples with unified TSS clusters. In the DU analysis step, we test for changes of TSS usage between biological conditions (e.g., cell types). (C) A toy example of the DU test given TSS counts from two cell types across five donors. We suppose TSS cluster 1 displays differential usage between the two cell types, with higher usage observed in cell type A, whereas the usage of TSS cluster 2 does not present significant difference between cell types. scTSS computes a P-value for each TSS cluster, indicating the significance of its differential usage between the two cell types.
Fig 2
Fig 2. TSS prediction results of the COVID-19 dataset. (A) The distribution of genomic distance between predicted dominant TSSs and the corresponding annotated TSSs using all genes with one annotated TSS. A positive distance means the predicted dominant TSS is on the 5’ side of the annotated TSS and vice versa. The analysis was restricted to TSS clusters whose predicted dominant TSSs fell within 500 bp of the annotated TSSs, since a larger distance more likely indicates missing annotations. (B) Accuracy of TSS cluster prediction compared with the complete FANTOM5 annotation. The precision, recall, and F1 scores were first calculated for each sample, and then averaged across samples. The error bars indicate the standard deviation of the scores. (C) The proportion of different types of predicted dominant TSSs based on SCAFE. The classification of TSS locations was based on the GRCh38 reference.
Fig 3
Fig 3. TSS prediction results of the Arthritis dataset. (A) The distribution of the genomic distance between predicted TSS cluster centers and the corresponding annotated TSSs using all genes with one annotated TSS. A positive distance means the predicted center is on the 5’ side of the annotated TSS and vice versa. The analysis was restricted to TSS clusters whose centers fell within 500 bp of the annotated TSSs. (B) Accuracy of the TSS cluster prediction compared with the FANTOM5 annotation. The precision, recall, and F1 scores were first calculated for each sample, and then averaged across samples. The error bars indicate the standard deviation of the scores. (C) Gene numbers with varying number of predicted TSS clusters based on the prediction of TSSr. The bars represent the mean value across samples, and the error bars represent the the standard deviation. (D) The proportion of different types of predicted TSS cluster centers based on TSSr. The classification of center locations was based on the GRCh38 reference. When the original near-site clusters were located in an intronic region, indicating missing annotations, we could not confirm the locations of genuine TSS centers. Therefore, we excluded those clusters from the categorization.
Fig 4
Fig 4. Simulation study of DU analysis with 500 cells per sample. (A) Comparison of type I error rates between seven TSS DU testing methods given various sample size, outlier frequency (π), and degree of outlier deviation. (B) Comparison of statistical power between the seven TSS DU testing methods. The results of bulk-level and cell-level Binomial GLMMs were virtually identical, resulting in overlapping lines.
Fig 5
Fig 5. Differential TSS usage analysis between the activated T (ACT) cells and B cells in the COVID-19 dataset. (A) Heatmap of normalized average TSS usage in ACT and B cells. For each TSS cluster, its sample-specific usage was first calculated by taking the average across all cells in that sample. Then, the average usage was normalized across samples using the min-max normalization. (B) tSNE plot of all ACT and B cells among healthy donors. (C) tSNE plots of ACT and B cells colored by the usage of different TSS clusters on IFITM2. Cells in which IFITM2 was not detected were excluded. (D) tSNE plots of ACT and B cells colored by the usage of different TSS clusters on NCF1. Cells in which NCF1 was not detected were excluded.
Fig 6
Fig 6. Clustering analysis of naïve T cells based on TSS-specific expression. (A) tSNE plot of naïve T cells based on TSS-specific expression. (B) Heatmap of sample-level TSS usage between clusters 3 and 4. The top 200 differential TSS clusters were selected based on P-values. For each TSS cluster, the min-max normalization was performed on its usage across samples.

Similar articles

References

    1. Sandelin A, Carninci P, Lenhard B, Ponjavic J, Hayashizaki Y, Hume DA. Mammalian RNA polymerase II core promoters: insights from genome-wide studies. Nat Rev Genet 2007;8(6):424–36. doi: 10.1038/nrg2026 - DOI - PubMed
    1. Policastro RA, Zentner GE. Global approaches for profiling transcription initiation. Cell Rep Methods 2021;1(5):100081. doi: 10.1016/j.crmeth.2021.100081 - DOI - PMC - PubMed
    1. Frankish A, Diekhans M, Jungreis I, Lagarde J, Loveland JE, Mudge JM, et al. GENCODE 2021. Nucleic Acids Res. 2021;49(D1):D916–23. doi: 10.1093/nar/gkaa1087 - DOI - PMC - PubMed
    1. Navarro-Gonzalez J, Zweig A, Speir M, Schmelter D, Rosenbloom K, Raney B. The UCSC genome browser database: 2021 update. Nucleic Acids Res. 2021;49(D1):D1046–57. - PMC - PubMed
    1. Malabat C, Feuerbach F, Ma L, Saveanu C, Jacquier A. Quality control of transcription start site selection by nonsense-mediated-mRNA decay. Elife. 2015;4:e06722. doi: 10.7554/eLife.06722 - DOI - PMC - PubMed