Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Nov;27(11):1939-1949.
doi: 10.1101/gr.220640.117. Epub 2017 Aug 30.

HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient

Affiliations

HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient

Tao Yang et al. Genome Res. 2017 Nov.

Abstract

Hi-C is a powerful technology for studying genome-wide chromatin interactions. However, current methods for assessing Hi-C data reproducibility can produce misleading results because they ignore spatial features in Hi-C data, such as domain structure and distance dependence. We present HiCRep, a framework for assessing the reproducibility of Hi-C data that systematically accounts for these features. In particular, we introduce a novel similarity measure, the stratum adjusted correlation coefficient (SCC), for quantifying the similarity between Hi-C interaction matrices. Not only does it provide a statistically sound and reliable evaluation of reproducibility, SCC can also be used to quantify differences between Hi-C contact matrices and to determine the optimal sequencing depth for a desired resolution. The measure consistently shows higher accuracy than existing approaches in distinguishing subtle differences in reproducibility and depicting interrelationships of cell lineages. The proposed measure is straightforward to interpret and easy to compute, making it well-suited for providing standardized, interpretable, automatable, and scalable quality control. The freely available R package HiCRep implements our approach.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
An illustration example. (A) Hi-C contact maps of the biological replicates of hESC and IMR90. (B) Relationship between genomic distance and the average contact frequency for the samples in A. Data are from Chromosome 22: 32000000–40000000.
Figure 2.
Figure 2.
A schematic representation of our method. (A) Step 1: smoothing; (B) Step 2: stratification.
Figure 3.
Figure 3.
Discrimination of pseudoreplicates (PR), biological replicates (BR), and nonreplicates (NR). (A) Reproducibility scores for the illustration example (hESC and IMR90 cell lines) in Figure 1. Red dots are the results in the original samples, and blue dots are the results after equalizing the sequencing depth in all samples. (B,C) Reproducibility scores for the BR and NR in the ENCODE 11 cancer cell lines. The triangle represents the score for a BR, and the box plot represents the distribution of the scores for NRs. (B) Reproducibility scores for BRs and NRs in all cell types. (C) SCC for BRs and the corresponding NRs in each cell type. From left to right, the cell lines are ordered according to the average sequencing depths of the biological replicates.
Figure 4.
Figure 4.
Estimating interrelationship between the 10 samples in the human H1 ESC lineage. (A) The heatmap and lineage relationship between the ES cell and its five derived cells based on A/B compartments in Hi-C data (Dixon et al. 2015) and RNA-seq data in (Xie et al. 2013). (BD) Estimated interrelationship based on the pairwise similarity score calculated using SCC (B), Pearson correlation (C), and Spearman correlation (D). Heatmaps show the similarity scores. Dendrograms resulted from a hierarchical clustering analysis based on the similarity scores. For easy visualization, the cell lines in the heatmaps are ordered according to their known distances to ES cells in A. A decreasing trend of scores is expected from left to right (from bottom to top, respectively) if the estimated interrelationship agrees with the known lineage.
Figure 5.
Figure 5.
Estimated interrelationship for 14 human primary tissues and two cell lines in Schmitt et al. (2016). The dendrograms result from a hierarchical clustering analysis based on the pairwise similarity calculated using SCC (A), Pearson correlation (B), and Spearman correlation (C).
Figure 6.
Figure 6.
Estimated similarity between the human H1 ES cell and its derived cells at different resolutions. (A) SCC; (B) Pearson correlation coefficient; and (C) Spearman correlation coefficient.
Figure 7.
Figure 7.
Detecting the change of reproducibility due to sequencing depth using SCC. (A) SCC of downsampled biological replicates (25%, 50%, 75%, and 100% of the original sequencing depth) for the five cell lines on the H1 ES cell lineage. (B) Relationship between SCC and Jaccard index, which measures the proportion of shared significant contacts identified by Fit-Hi-C between replicates for samples in A. (C) Saturation curves of SCC for data sets with different coverages. The SCC is plotted at different subsamples (10%–90%) of the original samples with 90% confidence intervals. The blue dots represent H1 human ESC data (original sequencing depth = 500 M). The red dots represent the A549 data (original sequencing depth = 30 M).

References

    1. Agresti A. 2012. Categorical data analysis. 3rd ed Wiley, New York.
    1. Ay F, Noble WS. 2015. Analysis methods for studying the 3D architecture of the genome. Genome Biol 16: 183. - PMC - PubMed
    1. Ay F, Bailey TL, Noble WS. 2014. Statistical confidence estimation for Hi-C data reveals regulatory chromatin contacts. Genome Res 24: 999–1011. - PMC - PubMed
    1. Bickmore WA. 2013. The spatial organization of the human genome. Annu Rev Genomics Hum Genet 14: 67–84. - PubMed
    1. Casella G, Berger GL. 2002. Statistical inference, 2nd ed Duxbury Press, Pacific Grove, CA.

Publication types

LinkOut - more resources