. 2011;6(8):e23409.

doi: 10.1371/journal.pone.0023409. Epub 2011 Aug 24.

Systematic clustering of transcription start site landscapes

Xiaobei Zhao¹, Eivind Valen, Brian J Parker, Albin Sandelin

Affiliations

PMID: 21887249
PMCID: PMC3160847
DOI: 10.1371/journal.pone.0023409

Systematic clustering of transcription start site landscapes

Xiaobei Zhao et al. PLoS One. 2011.

. 2011;6(8):e23409.

doi: 10.1371/journal.pone.0023409. Epub 2011 Aug 24.

Authors

Xiaobei Zhao¹, Eivind Valen, Brian J Parker, Albin Sandelin

Affiliation

¹ Department of Biology and Biotech Research and Innovation Centre, The Bioinformatics Centre, Copenhagen University, Copenhagen, Denmark.

PMID: 21887249
PMCID: PMC3160847
DOI: 10.1371/journal.pone.0023409

Abstract

Genome-wide, high-throughput methods for transcription start site (TSS) detection have shown that most promoters have an array of neighboring TSSs where some are used more than others, forming a distribution of initiation propensities. TSS distributions (TSSDs) vary widely between promoters and earlier studies have shown that the TSSDs have biological implications in both regulation and function. However, no systematic study has been made to explore how many types of TSSDs and by extension core promoters exist and to understand which biological features distinguish them. In this study, we developed a new non-parametric dissimilarity measure and clustering approach to explore the similarities and stabilities of clusters of TSSDs. Previous studies have used arbitrary thresholds to arrive at two general classes: broad and sharp. We demonstrated that in addition to the previous broad/sharp dichotomy an additional category of promoters exists. Unlike typical TATA-driven sharp TSSDs where the TSS position can vary a few nucleotides, in this category virtually all TSSs originate from the same genomic position. These promoters lack epigenetic signatures of typical mRNA promoters and a substantial subset of them are mapping upstream of ribosomal protein pseudogenes. We present evidence that these are likely mapping errors, which have confounded earlier analyses, due to the high similarity of ribosomal gene promoters in combination with known G addition bias in the CAGE libraries. Thus, previous two-class separations of promoter based on TSS distributions are motivated, but the ultra-sharp TSS distributions will confound downstream analyses if not removed.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. Clustering of transcription start site distributions (TSSDs).**
(A) Example of the heterogeneity of TSSDs in clustering based on dissimilarity alone. We clustered 100 randomly sampled TSSDs using hierarchical clustering, shown as a dendrogram on the left. The heatmap represent the different partitions that can be produced by placing a cut-line vertically in the dendrogram at various places: the second column shows the two-cluster partition (), the next the three cluster partition (), etc. The color intensity indicates the mean dissimilarity between all the TSSDs within one cluster (darker means higher homogeneity). Note that most clusters are inhomogeneous when k is low: clusters with high homogeneity only emerge when moving the cutline closer to the leaves. (B) Correlation between the mean cluster peakedness and the cluster stability. The scatter plot compares the cluster stability scores resulting from the bootstrap resampling to the intra-cluster peakedness scores. R denotes the Pearson's correlation coefficient of the scores (). (C) Distribution of intra-cluster peakedness scores of 500 TSSD clusters generated by hierarchical clustering. The Y-axis shows the intra-cluster peakedness scores. The red lines indicate offsets for defining three larger clusters using k-means. Each box represents one TSSD cluster.

formula image — **Figure 1. Clustering of transcription start site distributions (TSSDs).**
(A) Example of the heterogeneity of TSSDs in clustering based on dissimilarity alone. We clustered 100 randomly sampled TSSDs using hierarchical clustering, shown as a dendrogram on the left. The heatmap represent the different partitions that can be produced by placing a cut-line vertically in the dendrogram at various places: the second column shows the two-cluster partition (), the next the three cluster partition (), etc. The color intensity indicates the mean dissimilarity between all the TSSDs within one cluster (darker means higher homogeneity). Note that most clusters are inhomogeneous when k is low: clusters with high homogeneity only emerge when moving the cutline closer to the leaves. (B) Correlation between the mean cluster peakedness and the cluster stability. The scatter plot compares the cluster stability scores resulting from the bootstrap resampling to the intra-cluster peakedness scores. R denotes the Pearson's correlation coefficient of the scores (). (C) Distribution of intra-cluster peakedness scores of 500 TSSD clusters generated by hierarchical clustering. The Y-axis shows the intra-cluster peakedness scores. The red lines indicate offsets for defining three larger clusters using k-means. Each box represents one TSSD cluster.

**Figure 2. Properties of “scattered”, “dense” and “ultra-dense” TSSD clusters.**
(A) Examples of individual TSSDs of respective class: scattered TSSDs(left column), dense TSSDs (middle column) and ultra-dense TSSDS (right column) . The X-axis shows the relative genomic position with the 5′ end of the distribution is placed at coordinate 1. The Y-axis shows the fraction of the tags. The text above each distribution gives gene names or transcriptional unit identifiers of the TSSD in FANTOM3 database and the TSSD identifier is in brackets. The inset gives the width of the TSSD. (B) Distribution of the TSSD widths. The width distribution characterizes how dense the TSSDs are. Scattered, dense and ultra-dense TSSDs are in the top, middle, bottom panels, respectively. The X-axis shows the width of the TSSDs in unit of nt. Scattered TSSDs are mainly in the range from 20 nt to 200 nt; dense TSSDs are generally less than 20 nt long; ultra-dense TSSDs are in most cases 1 nt wide. (C) Box-plots showing the distribution of peakedness scores of individual TSSDs. Scattered, dense and ultra-dense TSSDs are in the left, middle, right boxes, respectively.

**Figure 3. Sequence and expression features of “scattered”, “dense” and “ultra-dense” TSSD classes.**
For each class, the TSSDs are aligned at their dominant peaks (labeled “TSS” at X-axis). (A) Sequence properties of promoters divided by TSSD class. Sequence logos of the DNA sequence of the TSSDs aligned at the dominant TSS. The x-axis shows the relative genomic positions, +1 indicates TSS. The y-axis shows the information content measured in bits. (B) TATA-box density of promoters divided by TSSD class. The count of predicted TATA sites flanking the dominant TSS (+/−100 nt) of the TSSDs. The X-axis shows the positions of the first T of the TATA site relative to the dominant TSS in the +−100 region; the Y-axis shows the number of predicted sites per TSSD. Note that the absolute frequencies of predicted sites are strongly dependent on the cutoffs, but the relative difference between different TSSDs are not cutoff-dependent. TATA sites are strongly over-represented at around −32 nt in the dense group (middle panel) but are less defined in the scattered group (top panel). The ultra-dense group (bottom panel) shows a small TATA signal located at either −32 nt or around −20 nt. (C) CpG island coverage of promoters divided by TSSD class. The coverage of CpG islands is illustrated in the flanking region (+/−1000 nt) around the TSSs. The X-axis shows the genomic position relative to the TSSs; the Y-axis shows the number of nucleotides covered by a CpG islands/TSSD. (D) Tissue specificity of TSSD classes. The box-plots show the distribution of the overall tissue specificity, given the class of the TSSD, calculated as the KullbacK-Leibler divergence. The smaller the distance is, the lower is the tissue specificity. (E) Sequence conservation. Sequence conservation is represented as mean PhastCons scores over all sites in the −/+1000 nt flanking region around the TSSs. PhastCons scores vary from 0 to 1, with 1 indicating high conservation. The X-axis shows the genomic position relative to the TSSs; the Y-axis shows the mean PhastCons scores. (F) Occurrence of repetitive elements in promoters divided by TSSD class. The X-axis shows the genomic position relative to the dominant peak of the TSSD; the Y-axis shows the number of nucleotides covered by respective repetitive elements, normalized by the number of TSSD. The transposable elements: LINE (top), LTR (middle) and SINE (bottom) are overrepresented in the ultra-dense core promoters (in blue) around the dominant TSS.

**Figure 4. Epigenetic features of “scattered”, “dense” and “ultra-dense” TSSD subclasses.**
The genomic positions relative to the dominant TSS of each TSSD are labeled on the X-axis. The signal strength from respective epigenetic mark/feature is shown on the Y-axis, counted as ChIP tags/TSSD (or equivalent for non-ChIP approaches). The profiles are (A) RNA Polymerase II (B) Nucleosome positioning (center of nucleosome); (C) DNA methylation; (D) Histone variant H2A.Z; (E)–(L) Histone modifications. The RNA Pol II binding profile is from mouse ES cell while the epigenetic marks are from human CD4+T cell and mapped to mouse genome. See main text for discussion and Figure S5 for additional data.

**Figure 5. Sequence and expression features of subclasses within the scattered TSSDs.**
The plots show the scattered TSSDs (shown in Figure 2– 4) divided by how many peaks they have. For each subclass, the TSSDs are aligned at their identified peak(s), denoted by green arrow(s), with the distance between two adjacent peaks rescaled to the same width in order to be comparable. The X-axis shows the genomic position relative to the peaks (TSS). The Y-axis shows the normalized signal per TSSD as in Figure 3. (A) Density of Pyrimidine-Purine (PyPu) dinucleotides, extended −50 nt at 5′ of the first peak and +50 nt 3′ of the last peak. Note that the PyPu dinucleotide enrichment is always positioned at −1/+1 nt of the peak(s), regardless of the number of peaks within the TSSD. (B) Density plot of nucleosome positioning, extended −100 nt at upstream of the first peak (the most 5′) and 300 nt downstream of the most 3′ peak. The nucleosome binding profile is from human CD4+T is plotted as in (Figure 4B). As in panel A), the distances between the TSSD peaks are rescaled to be the same in all TSSDs. In addition, d denotes the distance between the position of the highest nucleosome signal and the first peak. s denotes the scores of the binding intensity. Interestingly, the nucleosomal signal which is as expected at ∼+110 in the single peak TSSD is gradually shifted 20–30 nt downstream. In general, with more peaks the total nucleosomal positioning signals appears less distinct.

**Figure 6. Ultra-dense TSSDs associated with ribosomal protein pseudo genes and their transcribed counterparts.**
(A)–(B) Examples of TSSDs mapping to processed pseudogenes and corresponding transcribed ribosomal protein gene promoters. Each example shows an alignment of the pseudogene (top) and transcribed gene (bottom) with the sequence alignment in the middle and a genome-browser view as the inset. In the browser view, the CAGE distribution (TSSD), Mouse RefSeq, RefSeq from other species are shown as separate tracks. Note that the pseudogene has an ultra-dense TSS distribution just at the inferred 5′ end of the pseudogene. In the alignment, the tag distributions (red for pseudogene; blue for transcribed gene) are aligned and shown with sequence comparison along the x-axis in the middle. The Y-axis shows the number of CAGE tags mapping at the region, only counting the 5′ end. Note that the regions upstream of the TSS are generally dissimilar while the +2∼+20 nt region from the TSS of the pseudogene TSSD is almost identical (covered by grey boxes). The CT-tract is colored in blue. The position of the single CAGE peak in the pseudogene coincides with the 1 nt difference just upstream of the CT track, where the pseudogene has a G (colored in red). (C) Correlation between pseudogene and transcribed gene CAGE tags in terms of distribution over tissues. The Y-axis shows the fraction of tags from each tissue as a stacked barplot for each TSSD. Each panel shows a pair of TSSD from the transcribed gene and the corresponding pseudogene. The transcribed Mouse Rpl41 gene has two corresponding pseudogenes with their own ultra-dense TSSD, and therefore has three columns instead of two. Spearman correlation coefficients comparing the tissue distributions of pseudogene and transcribed gene CAGE tags are shown above each panel. All of the correlations are statistically significant: P<0.01 in all cases (data not shown).

See this image and copyright information in PMC

Cited by

Identifying transcript 5' capped ends in Plasmodium falciparum.
Shaw PJ, Piriyapongsa J, Kaewprommal P, Wongsombat C, Chaosrikul C, Teeravajanadet K, Boonbangyang M, Uthaipibull C, Kamchonwongpaisan S, Tongsima S. Shaw PJ, et al. PeerJ. 2021 Aug 25;9:e11983. doi: 10.7717/peerj.11983. eCollection 2021. PeerJ. 2021. PMID: 34527439 Free PMC article.
DNMT and HDAC inhibitors induce cryptic transcription start sites encoded in long terminal repeats.
Brocks D, Schmidt CR, Daskalakis M, Jang HS, Shah NM, Li D, Li J, Zhang B, Hou Y, Laudato S, Lipka DB, Schott J, Bierhoff H, Assenov Y, Helf M, Ressnerova A, Islam MS, Lindroth AM, Haas S, Essers M, Imbusch CD, Brors B, Oehme I, Witt O, Lübbert M, Mallm JP, Rippe K, Will R, Weichenhan D, Stoecklin G, Gerhäuser C, Oakes CC, Wang T, Plass C. Brocks D, et al. Nat Genet. 2017 Jul;49(7):1052-1060. doi: 10.1038/ng.3889. Epub 2017 Jun 12. Nat Genet. 2017. PMID: 28604729 Free PMC article.
VprBP/DCAF1 triggers melanomagenic gene silencing through histone H2A phosphorylation.
Shin Y, Kim S, Liang G, Ulmer TS, An W. Shin Y, et al. Res Sq [Preprint]. 2023 Jul 12:rs.3.rs-2950076. doi: 10.21203/rs.3.rs-2950076/v2. Res Sq. 2023. Update in: Biomedicines. 2023 Sep 17;11(9):2552. doi: 10.3390/biomedicines11092552. PMID: 37293029 Free PMC article. Updated. Preprint.
Trends in disease burden of hepatitis B infection in Jiangsu Province, China, 1990-2021.
Fang K, Shi Y, Zhao Z, Zhao Y, Guo Y, Abudunaibi B, Qu H, Liu Q, Kang G, Wang Z, Hu J, Chen T. Fang K, et al. Infect Dis Model. 2023 Jul 10;8(3):832-841. doi: 10.1016/j.idm.2023.07.007. eCollection 2023 Sep. Infect Dis Model. 2023. PMID: 37520113 Free PMC article.
MMP-9-dependent proteolysis of the histone H3 N-terminal tail: a critical epigenetic step in driving oncogenic transcription and colon tumorigenesis.
Shin Y, Kim S, Liang G, An W. Shin Y, et al. Mol Oncol. 2024 Aug;18(8):2001-2019. doi: 10.1002/1878-0261.13652. Epub 2024 Apr 10. Mol Oncol. 2024. PMID: 38600695 Free PMC article.

See all "Cited by" articles

References

1. Smale ST, Kadonaga JT. The RNA polymerase II core promoter. Annu Rev Biochem. 2003;72:449–479. - PubMed
1. Carninci P, Kasukawa T, Katayama S, Gough J, Frith MC, et al. The transcriptional landscape of the mammalian genome. Science. 2005;309:1559–1563. - PubMed
1. Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007;447:799–816. - PMC - PubMed
1. Maruyama K, Sugano S. Oligo-capping: a simple method to replace the cap structure of eukaryotic mRNAs with oligoribonucleotides. Gene. 1994;138:171–174. - PubMed
1. Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, et al. Human-mouse alignments with BLASTZ. Genome Research. 2003;13:103–107. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Systematic clustering of transcription start site landscapes

Affiliation

Systematic clustering of transcription start site landscapes

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases