Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Apr 3:12:RP89548.
doi: 10.7554/eLife.89548.

Integrative analysis of DNA replication origins and ORC-/MCM-binding sites in human cells reveals a lack of overlap

Affiliations

Integrative analysis of DNA replication origins and ORC-/MCM-binding sites in human cells reveals a lack of overlap

Mengxue Tian et al. Elife. .

Abstract

Based on experimentally determined average inter-origin distances of ~100 kb, DNA replication initiates from ~50,000 origins on human chromosomes in each cell cycle. The origins are believed to be specified by binding of factors like the origin recognition complex (ORC) or CTCF or other features like G-quadruplexes. We have performed an integrative analysis of 113 genome-wide human origin profiles (from five different techniques) and five ORC-binding profiles to critically evaluate whether the most reproducible origins are specified by these features. Out of ~7.5 million union origins identified by all datasets, only 0.27% (20,250 shared origins) were reproducibly obtained in at least 20 independent SNS-seq datasets and contained in initiation zones identified by each of three other techniques, suggesting extensive variability in origin usage and identification. Also, 21% of the shared origins overlap with transcriptional promoters, posing a conundrum. Although the shared origins overlap more than union origins with constitutive CTCF-binding sites, G-quadruplex sites, and activating histone marks, these overlaps are comparable or less than that of known transcription start sites, so that these features could be enriched in origins because of the overlap of origins with epigenetically open, promoter-like sequences. Only 6.4% of the 20,250 shared origins were within 1 kb from any of the ~13,000 reproducible ORC-binding sites in human cancer cells, and only 4.5% were within 1 kb of the ~11,000 union MCM2-7-binding sites in contrast to the nearly 100% overlap in the two comparisons in the yeast, Saccharomyces cerevisiae. Thus, in human cancer cell lines, replication origins appear to be specified by highly variable stochastic events dependent on the high epigenetic accessibility around promoters, without extensive overlap between the most reproducible origins and currently known ORC- or MCM-binding sites.

Keywords: DNA replication; MCM2-7; ORC; chromosomes; gene expression; human; integrative analysis; origins of replication.

PubMed Disclaimer

Conflict of interest statement

MT, ZW, ZS, ES, YS, AD, CZ No competing interests declared

Figures

Figure 1.
Figure 1.. A total of 7,459,709 origins defined by four types of techniques show different genomic features.
(a) Data processing pipeline. A total of 113 publicly available profiles of origins are processed following the pipeline. (b) Number of samples collected for each technique. In total, 7,459,709 union origins were identified. (c) Principal component analysis (PCA) shows the clustering of origin datasets from different techniques. (d) Genomic annotation (transcription start sites [TSS], exon, intron, and intergenic regions) of different groups of origins. Background is the percentage of each annotation on the whole genome. (e) Overlap with TF hotspots for different groups of origins and promoters. (f) Overlap with constitutive CTCF-binding sites for different groups of origins and promoters. (g) GC content of different groups of origins and promoters. Gray line marks the average GC content of the human genome. (h) G-quadruplex overlapping rates of different groups of origins and promoters.
Figure 1—figure supplement 1.
Figure 1—figure supplement 1.. Distribution of origins defined by four types of techniques.
(a) For each cell type, how many samples we have collected. (b) Distribution of peak length of origins from each technique. (c) Principal component analysis (PCA) results of all samples, marked by cell types. (d) PCA results of short nascent strand-seq (SNS-seq) samples, marked by cell types. (e) PCA results of SNS-seq samples, marked by the year of the data uploaded. (f) BART2 results of union origins. (g) Enrichment of histone marks at re-replicated union origins using total union origins as control.
Figure 1—figure supplement 2.
Figure 1—figure supplement 2.. Correlation between origins from different samples.
Pairwise correlation of samples from different techniques.
Figure 2.
Figure 2.. The shared origins are enriched with certain transcription factors and active histone marks.
(a) Short nascent strand-seq (SNS-seq) origin-fitting distribution to an exponential model shows an occupancy score ≥20 is selected for reproducible SNS-seq origins. (b) Conceptual model of how the shared origins are determined. Any SNS-seq shared origin that overlaps with Bubble-seq initiation zone (IZ), Okazaki-seq (OK-seq) IZ, and Repli-seq origin together is considered as an origin identified by all four techniques (shared origins). (c) Genomic annotation of union origins and shared origins. (d) Overlap with TF hotspots of union origins and shared origins. (e) Overlap with constitutive CTCF-binding sites of union origins and shared origins. (f) GC content of union origins and shared origins. (g) G-quadruplex overlapping rates of union origins and shared origins. (h) BART prediction of TFs associated with shared origins. (i) Enrichment of histone marks at shared origins using all union origins as control.
Figure 2—figure supplement 1.
Figure 2—figure supplement 1.. Background model for the identification of the shared origins.
(a) Conceptual model of how occupancy score is defined to represent the number of samples that each origin occurs. (b) Distribution of occupancy score of short nascent strand-seq (SNS-seq) union origins (300 bp).
Figure 2—figure supplement 2.
Figure 2—figure supplement 2.. Origins/IZs defined by different techniques: (a) SNS-seq, (b) OK-seq and (c) Repli-seq in K562 cell line and their overlap with the shared origins.
Shared origins are defined from all samples. The number of shared origins covered by each file is calculated and marked in the figure. Numbers in the parentheses are the number of peaks in the other dataset that overlap with the shared origins.
Figure 3.
Figure 3.. Genomic features of the shared origin recognition complex (ORC)-binding sites and their co-localization with the shared origins.
(a) Genomic annotation of union ORC and shared ORC-binding sites. (b) Overlap with TF hotspot of union ORC and shared ORC-binding sites. (c) Overlap with constitutive CTCF-binding sites of union ORC and shared ORC-binding sites. (d) GC content of union ORC and shared ORC-binding sites. (e) Overlap with G-quadruplex of union ORC and shared ORC-binding sites. (f) The percentage of high-confidence origins (shared origins in humans and confirmed origins in yeast) that overlapped with (left) or are proximate to (≤1 kb) (right) two types of ORC-binding sites (union or shared). (g) Distribution of the distance between ORC-binding sites and the nearest shared origin.
Figure 3—figure supplement 1.
Figure 3—figure supplement 1.. Analysis of overlap between the shared origin recognition complex (ORC)-binding sites and origins.
(a) A total of 12,712 ORC-binding sites in the human genome were shared by at least two ORC ChIP-seq datasets. The overlapping rates with the shared origins are provided. (b) Overlapping of union origins, MCM3-7, and ORC2 in K562 cell line. (c) Overlapping of union origins and MCM2 in HCT116 cell line. (d) Overlapping of union origins and ORC1 in HeLa cell line. (e) Overlapping of shared origins seen in K562 cells with ORC and MCM-binding sites in K562 cells. Shared origins seen in K562 cells were generated from short nascent strand-seq (SNS-seq) files that overlapped with K562 initiation zones (IZs) (defined by Okazaki-seq [OK-seq] and Repli-seq). (f) Overlapping of shared origins seen in HeLa cells with ORC-binding sites in HeLa cells. The shared origins seen in HeLa cells were generated from three HeLa derived SNS-seq samples using the intersected peaks from NS_GSM3983205_hela_siNC.bed, NS_GSM3983206_hela_siNC.bed, NS_GSM3983210_hela_siH2A.Z.bed.
Figure 4.
Figure 4.. Shared origins near shared origin recognition complex (ORC)-binding sites are more correlated with active transcription.
(a) Genomic annotation of shared origins and shared origins near (≤1 kb) the ORC-binding sites. (b) Overlap with TF hotspots of shared origins and shared origins near the ORC-binding sites. (c) Overlap with constitutive CTCF-binding sites of shared origins and shared origins near the ORC-binding sites. (d) GC content of shared origins and shared origins near the ORC-binding sites. (e) Overlap with G-quadruplex sites of shared origins and shared origins near the ORC-binding sites. (f) Y-axis: replication timing score from Navarro Gonzalez et al., 2021 for the indicated classes of origins. (g) Annotation of expression level of genes that overlapped with different groups of origins. (h) BART prediction of TFs associated with the highest confidence origins.
Figure 5.
Figure 5.. Genomic features of the shared minichromosome maintenance complex (MCM)-binding sites and their co-localization with the shared origins.
(a) Genomic annotation of union MCM and shared MCM-binding sites. (b) Overlap with TF hotspot of union MCM and shared MCM-binding sites. (c) Overlap with constitutive CTCF-binding rates of union MCM and shared MCM-binding sites. (d) GC content of union MCM and shared MCM-binding sites. (e) Overlap with G-quadruplex of union MCM and shared MCM-binding sites. (f) Overlap with TF hotspots of shared origins and shared origins near the MCM-binding sites. (g) The percentage of high-confidence origins (shared origins in humans and confirmed origins in yeast) that overlapped with (left) or are proximate to (≤1 kb) (right) two types of MCM-binding sites (union or shared). (h) Venn diagram of the shared origins that are near the ORC-, MCM2-, or MCM3-7-binding sites.
Figure 5—figure supplement 1.
Figure 5—figure supplement 1.. Analyses of a few selected origin sets suggested by the reviewers.
Figure 5—figure supplement 2.
Figure 5—figure supplement 2.. Origin recognition complex (ORC) subunits do not co-bind to DNA as much as expected.
Figure 5—figure supplement 3.
Figure 5—figure supplement 3.. Shared origins overlap with phosphorylated MCM2.
(a) Percentage of shared origins that overlap with phosphorylated MCM2-binding sites. (b) Percentage of shared origins that are near phosphorylated MCM2-binding sites.
Figure 5—figure supplement 4.
Figure 5—figure supplement 4.. Selecting fewer but even more reproducible origins with more stringent cutoff to determine their overlap with the origin recognition complex (ORC) and minichromosome maintenance complex (MCM)-binding sites.
(a) The percentage of high-confidence origins (defined by the cutoff of occupancy score indicated on the X-axis) that overlapped with the union or shared ORC-binding sites. (b) Similar to (a), except the percentage of origins that are near (<1 kb) the ORC-binding sites. (c) Similar to (a) except the percentage of origins that overlap with the union or shared MCM-binding sites. (d) Similar to (c), except the percentage of origins near (<1 kb) the MCM-binding sites.
Figure 6.
Figure 6.. Genome browser screenshots for 3 of the 74 origins from Figure 5h.
The numbers below the short nascent strand-seq (SNS-seq) shared origins track are the occupancy score of the origins along the length of the indicated track.

Update of

References

    1. Akerman I, Kasaai B, Bazarova A, Sang PB, Peiffer I, Artufel M, Derelle R, Smith G, Rodriguez-Martinez M, Romano M, Kinet S, Tino P, Theillet C, Taylor N, Ballester B, Méchali M. A predictable conserved DNA base composition signature defines human core DNA replication origins. Nature Communications. 2020;11:4826. doi: 10.1038/s41467-020-18527-0. - DOI - PMC - PubMed
    1. Andrews S. FastQC: A quality control tool for high throughput sequence data. 0.11.9Babraham Bioinformatics. 2010 https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
    1. Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Holko M, Yefanov A, Lee H, Zhang N, Robertson CL, Serova N, Davis S, Soboleva A. NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Research. 2013;41:D991–D995. doi: 10.1093/nar/gks1193. - DOI - PMC - PubMed
    1. Bedrat A, Lacroix L, Mergny JL. Re-evaluation of G-quadruplex propensity with G4Hunter. Nucleic Acids Research. 2016;44:1746–1759. doi: 10.1093/nar/gkw006. - DOI - PMC - PubMed
    1. Bell SP, Stillman B. ATP-dependent recognition of eukaryotic origins of DNA replication by a multiprotein complex. Nature. 1992;357:128–134. doi: 10.1038/357128a0. - DOI - PubMed

MeSH terms