. 2025 Apr 3;21(4):e1012878.

doi: 10.1371/journal.pcbi.1012878. eCollection 2025.

Predicting and comparing transcription start sites in single cell populations

Shiwei Fu¹, Wei Vivian Li¹

Affiliations

PMID: 40179341
PMCID: PMC11968111
DOI: 10.1371/journal.pcbi.1012878

Predicting and comparing transcription start sites in single cell populations

Shiwei Fu et al. PLoS Comput Biol. 2025.

. 2025 Apr 3;21(4):e1012878.

doi: 10.1371/journal.pcbi.1012878. eCollection 2025.

Authors

Shiwei Fu¹, Wei Vivian Li¹

Affiliation

¹ Department of Statistics, University of California, Riverside, Riveside, California, United States of America.

PMID: 40179341
PMCID: PMC11968111
DOI: 10.1371/journal.pcbi.1012878

Abstract

The advent of 5' single-cell RNA sequencing (scRNA-seq) technologies offers unique opportunities to identify and analyze transcription start sites (TSSs) at a single-cell resolution. These technologies have the potential to uncover the complexities of transcription initiation and alternative TSS usage across different cell types and conditions. Despite the emergence of computational methods designed to analyze 5' RNA sequencing data, current methods often lack comparative evaluations in single-cell contexts and are predominantly tailored for paired-end data, neglecting the potential of single-end data. This study introduces scTSS, a computational pipeline developed to bridge this gap by accommodating both paired-end and single-end 5' scRNA-seq data. scTSS enables joint analysis of multiple single-cell samples, starting with TSS cluster prediction and quantification, followed by differential TSS usage analysis. It employs a Binomial generalized linear mixed model to accurately and efficiently detect differential TSS usage. We demonstrate the utility of scTSS through its application in analyzing transcriptional initiation from single-cell data of two distinct diseases. The results illustrate scTSS's ability to discern alternative TSS usage between different cell types or biological conditions and to identify cell subpopulations characterized by unique TSS-level expression profiles.

Copyright: © 2025 Fu and Li. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1. An overview of the scTSS toolkit. (A) A schematic comparison between the paired-end data (on-site data; left) and single-end data (near-site data; right). In the on-site data, Read 1 directly captures the genuine TSS. In the near-site data, Read 1 does not directly capture the genuine TSS, but there exists a predictable relationship between Read 2 and the genuine TSS, facilitating its estimation. (B) The workflow of the scTSS toolkit. scTSS consists of two major steps, TSS cluster prediction and quantification, followed by differential TSS usage (DU) analysis. In the prediction and quantification step, the toolkit takes in either predicted TSS clusters based on on-site reads, or predicted near-site clusters based on near-site reads, and eventually outputs cell-level count matrices across samples with unified TSS clusters. In the DU analysis step, we test for changes of TSS usage between biological conditions (e.g., cell types). (C) A toy example of the DU test given TSS counts from two cell types across five donors. We suppose TSS cluster 1 displays differential usage between the two cell types, with higher usage observed in cell type A, whereas the usage of TSS cluster 2 does not present significant difference between cell types. scTSS computes a P-value for each TSS cluster, indicating the significance of its differential usage between the two cell types.

Fig 2. TSS prediction results of the COVID-19 dataset. (A) The distribution of genomic distance between predicted dominant TSSs and the corresponding annotated TSSs using all genes with one annotated TSS. A positive distance means the predicted dominant TSS is on the 5’ side of the annotated TSS and vice versa. The analysis was restricted to TSS clusters whose predicted dominant TSSs fell within 500 bp of the annotated TSSs, since a larger distance more likely indicates missing annotations. (B) Accuracy of TSS cluster prediction compared with the complete FANTOM5 annotation. The precision, recall, and F1 scores were first calculated for each sample, and then averaged across samples. The error bars indicate the standard deviation of the scores. (C) The proportion of different types of predicted dominant TSSs based on SCAFE. The classification of TSS locations was based on the GRCh38 reference.

Fig 3. TSS prediction results of the Arthritis dataset. (A) The distribution of the genomic distance between predicted TSS cluster centers and the corresponding annotated TSSs using all genes with one annotated TSS. A positive distance means the predicted center is on the 5’ side of the annotated TSS and vice versa. The analysis was restricted to TSS clusters whose centers fell within 500 bp of the annotated TSSs. (B) Accuracy of the TSS cluster prediction compared with the FANTOM5 annotation. The precision, recall, and F1 scores were first calculated for each sample, and then averaged across samples. The error bars indicate the standard deviation of the scores. (C) Gene numbers with varying number of predicted TSS clusters based on the prediction of TSSr. The bars represent the mean value across samples, and the error bars represent the the standard deviation. (D) The proportion of different types of predicted TSS cluster centers based on TSSr. The classification of center locations was based on the GRCh38 reference. When the original near-site clusters were located in an intronic region, indicating missing annotations, we could not confirm the locations of genuine TSS centers. Therefore, we excluded those clusters from the categorization.

Fig 4. Simulation study of DU analysis with 500 cells per sample. (A) Comparison of type I error rates between seven TSS DU testing methods given various sample size, outlier frequency (π), and degree of outlier deviation. (B) Comparison of statistical power between the seven TSS DU testing methods. The results of bulk-level and cell-level Binomial GLMMs were virtually identical, resulting in overlapping lines.

Fig 5. Differential TSS usage analysis between the activated T (ACT) cells and B cells in the COVID-19 dataset. (A) Heatmap of normalized average TSS usage in ACT and B cells. For each TSS cluster, its sample-specific usage was first calculated by taking the average across all cells in that sample. Then, the average usage was normalized across samples using the min-max normalization. (B) tSNE plot of all ACT and B cells among healthy donors. (C) tSNE plots of ACT and B cells colored by the usage of different TSS clusters on *IFITM2*. Cells in which *IFITM2* was not detected were excluded. (D) tSNE plots of ACT and B cells colored by the usage of different TSS clusters on *NCF1*. Cells in which *NCF1* was not detected were excluded.

Fig 6. Clustering analysis of naïve T cells based on TSS-specific expression. (A) tSNE plot of naïve T cells based on TSS-specific expression. (B) Heatmap of sample-level TSS usage between clusters 3 and 4. The top 200 differential TSS clusters were selected based on P-values. For each TSS cluster, the min-max normalization was performed on its usage across samples.

See this image and copyright information in PMC

References

1. Sandelin A, Carninci P, Lenhard B, Ponjavic J, Hayashizaki Y, Hume DA. Mammalian RNA polymerase II core promoters: insights from genome-wide studies. Nat Rev Genet 2007;8(6):424–36. doi: 10.1038/nrg2026 - DOI - PubMed
1. Policastro RA, Zentner GE. Global approaches for profiling transcription initiation. Cell Rep Methods 2021;1(5):100081. doi: 10.1016/j.crmeth.2021.100081 - DOI - PMC - PubMed
1. Frankish A, Diekhans M, Jungreis I, Lagarde J, Loveland JE, Mudge JM, et al. GENCODE 2021. Nucleic Acids Res. 2021;49(D1):D916–23. doi: 10.1093/nar/gkaa1087 - DOI - PMC - PubMed
1. Navarro-Gonzalez J, Zweig A, Speir M, Schmelter D, Rosenbloom K, Raney B. The UCSC genome browser database: 2021 update. Nucleic Acids Res. 2021;49(D1):D1046–57. - PMC - PubMed
1. Malabat C, Feuerbach F, Ma L, Saveanu C, Jacquier A. Quality control of transcription start site selection by nonsense-mediated-mRNA decay. Elife. 2015;4:e06722. doi: 10.7554/eLife.06722 - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Predicting and comparing transcription start sites in single cell populations

Affiliation

Predicting and comparing transcription start sites in single cell populations

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous