Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jun;11(6):689-94.
doi: 10.1038/nmeth.2924. Epub 2014 Apr 13.

Multiscale representation of genomic signals

Affiliations

Multiscale representation of genomic signals

Theo A Knijnenburg et al. Nat Methods. 2014 Jun.

Abstract

Genomic information is encoded on a wide range of distance scales, ranging from tens of bases to megabases. We developed a multiscale framework to analyze and visualize the information content of genomic signals. Different types of signals, such as G+C content or DNA methylation, are characterized by distinct patterns of signal enrichment or depletion across scales spanning several orders of magnitude. These patterns are associated with a variety of genomic annotations. By integrating the information across all scales, we demonstrated improved prediction of gene expression from polymerase II chromatin immunoprecipitation sequencing (ChIP-seq) measurements, and we observed that gene expression differences in colorectal cancer are related to methylation patterns that extend beyond the single-gene scale. Our software is available at https://github.com/tknijnen/msr/.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Four-step procedure for the multiscale segmentation of genomic signals. The depicted genomic signal is a part of a Pol II ChIP-seq signal derived from primary murine bone marrow macrophage cells after 1 hour of lipopolysaccharide stimulation, mapped to genome assembly mm9. The genomic coordinates are in Mb. (a) Smoothing of the genomic signal at different scales results in the Gaussian scale space. The scale space is represented as a heatmap below the original signal. (b) A segmentation tree is created by propagating nodes from the smallest scale to the largest scale. This tree is visualized by the white lines. The three pairs of green and black branches are examples of how to derive segments from the tree, as is explained in the text. (c) Segments at multiple scales are derived from the segmentation tree. The three green and black rectangles (segments) are derived from the green and black pairs of branches in b. The different segments are colored according to the (log2 transformed) fold change between observed and expected signal intensity within the segments. (d) The segments are scored for enrichment using a statistical testing procedure that outputs the fold change that is statistically significant at a predefined confidence level (‘Significant Fold Change’, SFC).
Figure 2
Figure 2
Significant fold change (SFC) of segments across multiple scales for different ChIP targets, conservation scores and GC content. Each of the six panels displays the results of a genome-wide multiscale segmentation analysis. In each panel, the heat map diagram shows a two-dimensional histogram that is created by binning the segments based on their scale (x-axis) and on their SFC (y-axis). The color indicates the number of segments. The three panels on the left represent specific ChIP targets; their two-dimensional histograms are averages across multiple histograms derived from ChIP-seq experiments performed under different experimental conditions. The upper-right histogram is an average across the histograms corresponding to the ChIP experiments of three TFs; ATF3, p50 and p65 (also based on multiple experimental conditions). The histograms were averaged, as they were very similar for the three TFs. The middle and lower-right panels represent the multiscale signatures derived from the 29-way vertebrate conservation scores and the GC content signal, respectively. In the bottom of the figure the median (black line) and interquartile range (grey fill) of the segment sizes at different scales is shown. The median segments sizes for scales 5,10,…,45 are stated between the heat maps.
Figure 3
Figure 3
Overlap between functional genomic regions and the segments comprising the MSRs of genomic signals. (a,b) The heatmaps depict the degree of overlap between genomic regions and enriched segments (SFC>0) of the conservation (in a) and GC content (in b) genomic signals. The genomic regions are shown on the left of the heatmaps. Positive scores represent a larger overlap than expected by chance; negative numbers a smaller overlap. A grey color indicates that fewer than ten enriched segments at that scale were found. In that case, the overlap score was not computed. The top panel depicts the median (black line) and interquartile range (grey fill) of the segment sizes across the 50 scales. The median segments sizes for scales 5,10,…,45 are stated between the heat maps.
Figure 4
Figure 4
DNA methylation MSR of differentially expressed colorectal cancer genes (a) A two-dimensional histogram created by binning the segments based on their scale and on their differential DNA methylation score (DM) between a primary human colorectal tumor and adjacent normal colon tissue. The color indicates the number of segments. The x-axis and y-axis represent the scale and the DM, respectively. Positive DM scores indicate hypermethylation in the tumor, whereas negative DM scores represent hypomethylation. DM scores in the bin around zero were removed. This figure is created similarly to Fig. 2. (b) Four groups of genes were created based on the differential expression between tumor and normal: 1) the strongly upregulated set of genes have at least 1 unit more expression in the tumor than in the normal tissue; 2) strongly downregulated genes have at least 1 unit less expression; 3) moderately upregulated genes have between 0.1 and 1 higher expression in tumor; and 4) moderately downregulated genes have between 0.1 and 1 lower expression. (A difference of 1 unit corresponds to a doubling or halving for these log2 transformed gene expression values.) These groups were compared with the genes that had the largest hyper- or hypomethylation values around the TSS at a particular scale. The heatmaps depict the P-values of the hypergeometric overlap test, where red indicates significant overlap (enrichment), blue indicates a significant lack of overlap (depletion) and white represents the randomly expected overlap. (c) Similar to b, but for segments overlapping the gene middle (GM).

References

    1. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5:621–628. - PubMed
    1. Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007;316:1497–1502. - PubMed
    1. Lister R, et al. Human DNA methylomes at base resolution show widespread epigenomic differences. Nature. 2009;462:315–322. - PMC - PubMed
    1. Bernstein BE, et al. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. - PMC - PubMed
    1. Park PJ. ChIP-seq: advantages and challenges of a maturing technology. Nat Rev Genet. 2009;10:669–680. - PMC - PubMed

Publication types

Associated data