Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009;10(7):R79.
doi: 10.1186/gb-2009-10-7-r79. Epub 2009 Jul 22.

Methods for analyzing deep sequencing expression data: constructing the human and mouse promoterome with deepCAGE data

Affiliations

Methods for analyzing deep sequencing expression data: constructing the human and mouse promoterome with deepCAGE data

Piotr J Balwierz et al. Genome Biol. 2009.

Abstract

With the advent of ultra high-throughput sequencing technologies, increasingly researchers are turning to deep sequencing for gene expression studies. Here we present a set of rigorous methods for normalization, quantification of noise, and co-expression analysis of deep sequencing data. Using these methods on 122 cap analysis of gene expression (CAGE) samples of transcription start sites, we construct genome-wide 'promoteromes' in human and mouse consisting of a three-tiered hierarchy of transcription start sites, transcription start clusters, and transcription start regions.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Reverse cumulative distributions for the number of different TSS positions that have at least a given number of tags mapping to them. Both axes are shown on a logarithmic scale. The three red curves correspond to the distributions of the three THP-1 cell control samples and the three blue curves to the three THP-1 samples after 24 hours of phorbol myristate acetate treatment. All other samples show very similar distributions (data not shown).
Figure 2
Figure 2
Fitted off-sets n0 (horizontal axis) and fitted exponents α (vertical axis) for the 56 human CAGE samples that have at least 100,000 tags.
Figure 3
Figure 3
Normalized reverse cumulative distributions for the number of different TSS positions that have at least a given number of tags mapping to them. Both axes are shown on a logarithmic scale. The three red curves correspond to the distributions of the three THP-1 control samples and the three blue curves to the three THP-1 samples after 24 hours of PMA treatment.
Figure 4
Figure 4
CAGE replicate from THP-1 cells after 8 hours of lipopolysaccharide treatment. For each position with mapped tags, the logarithm of the number of tags per million (TPM) in the first replicate is shown on the horizontal axis, and the logarithm of the number of TPM in the second replicate on the vertical axis. Logarithms are natural logarithms.
Figure 5
Figure 5
Observed histograms of z-statistics for the three 0/1 hour (in red, dark blue, and light blue) samples and for the technical replicate (in yellow) compared with the standard unit Gaussian (in black). The vertical axis is shown on a logarithmic scale.
Figure 6
Figure 6
Average log-ratio L (Equation 13) for neighboring pairs of individual TSSs as a function of the distance between the TSSs. The horizontal axis is shown on a logarithmic scale.
Figure 7
Figure 7
Average PhastCons (conservation) score relative to TSSs of genomic regions upstream and downstream of all human TSCs. The vertical lines show positions -300 and +100 with respect to TSSs.
Figure 8
Figure 8
The number of TSCs as a function of their position relative to the nearest known mRNA start. Negative numbers mean the nearest known mRNA start is upstream of the TSC. The vertical axis is shown on a logarithmic scale. The figure shows only the 46,293 TSCs (62.3%) that have a known mRNA start within 1,000 bp.
Figure 9
Figure 9
Average PhastCons (conservation) score relative to TSSs of genomic regions upstream and downstream of 'novel' human TSCs that are more than 5 kb away from the start of any known transcript.
Figure 10
Figure 10
Hierarchical structure of the human promoterome. (a) Distribution of the number of TSSs per co-expressed TSC. (b) Distribution of the number of TSCs per TSR. (c) Distribution of the number of TSSs per TSR. The vertical axis is shown on a logarithmic scale in all panels. The horizontal axis is shown on a logarithmic scale in (a, c).
Figure 11
Figure 11
Length (base pairs along the genome) distribution of (a) TSCs and (b) TSRs. Both axes are shown on logarithmic scales in both panels.
Figure 12
Figure 12
Nearby TSCs with significantly differing expression profiles. (a) A 90-bp region on chromosome 3 containing 5 TSCs (colored segments) and the start of the annotated locus of the SENP5 gene (black segment). (b) Positions of the individual TSSs in the TSC and their total expression, colored according to the TSC to which each TSS belongs. (c) Expression across the 56 CAGE samples for the red and blue colored TSCs.
Figure 13
Figure 13
Reverse cumulative distributions of the standard deviation in expression across the 56 CAGE samples for the TSCs obtained with our clustering procedure (red) and the FANTOM3 single-linkage clustering procedure (green).
Figure 14
Figure 14
Two-dimensional histogram (shown as a heatmap) of the CG base content (horizontal axis) and CpG dinucleotide content (vertical axis) of all human TSRs. Both axes are shown on logarithmic scales.
Figure 15
Figure 15
Reverse cumulative distribution of the lengths (base-pairs along the genome) of TSRs for high-CpG (red curve) and low-CpG (green curve) promoters. The horizontal axis is shown on a logarithmic scale.

References

    1. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat Methods. 2008;5:621–628. doi: 10.1038/nmeth.1226. - DOI - PubMed
    1. Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008;320:1344–1349. doi: 10.1126/science.1158441. - DOI - PMC - PubMed
    1. Wilhelm BT, Marguerat S, Watt S, Schuber F, Wood V, Goodhead I, Penkett CJ, Rogers J, Bahler J. Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature. 2008;453:1239–1245. doi: 10.1038/nature07002. - DOI - PubMed
    1. Rosenkranz R, Borodina T, Lehrach H, Himmelbauer H. Characterizing the mouse ES cell transcriptome with Illumina sequencing. Genomics. 2008;92:187–194. doi: 10.1016/j.ygeno.2008.05.011. - DOI - PubMed
    1. Maeda N, Nishiyori H, Nakamura M, Kawazu C, Murata M, Sano H, Hayashida K, Fukuda S, Tagami M, Hasegawa A, Murakami K, Schroder K, Hume KID, Hayashizaki Y, Carninci P, Suzuki H. Development of a DNA barcode tagging method for monitoring dynamic changes in gene expression by using an ultra high-throughput sequencer. Biotechniques. 2008;45:95–97. doi: 10.2144/000112814. - DOI - PubMed

Publication types

MeSH terms