Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Sep 27;113(39):E5765-74.
doi: 10.1073/pnas.1603241113. Epub 2016 Sep 14.

Inevitability and containment of replication errors for eukaryotic genome lengths spanning megabase to gigabase

Affiliations

Inevitability and containment of replication errors for eukaryotic genome lengths spanning megabase to gigabase

Mohammed Al Mamun et al. Proc Natl Acad Sci U S A. .

Abstract

The replication of DNA is initiated at particular sites on the genome called replication origins (ROs). Understanding the constraints that regulate the distribution of ROs across different organisms is fundamental for quantifying the degree of replication errors and their downstream consequences. Using a simple probabilistic model, we generate a set of predictions on the extreme sensitivity of error rates to the distribution of ROs, and how this distribution must therefore be tuned for genomes of vastly different sizes. As genome size changes from megabases to gigabases, we predict that regularity of RO spacing is lost, that large gaps between ROs dominate error rates but are heavily constrained by the mean stalling distance of replication forks, and that, for genomes spanning ∼100 megabases to ∼10 gigabases, errors become increasingly inevitable but their number remains very small (three or less). Our theory predicts that the number of errors becomes significantly higher for genome sizes greater than ∼10 gigabases. We test these predictions against datasets in yeast, Arabidopsis, Drosophila, and human, and also through direct experimentation on two different human cell lines. Agreement of theoretical predictions with experiment and datasets is found in all cases, resulting in a picture of great simplicity, whereby the density and positioning of ROs explain the replication error rates for the entire range of eukaryotes for which data are available. The theory highlights three domains of error rates: negligible (yeast), tolerable (metazoan), and high (some plants), with the human genome at the extreme end of the middle domain.

Keywords: Poisson distribution; eukaryotes; genome length; mathematical modeling; replication error.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig. 1.
Fig. 1.
Potential outcomes arising from ROs licensed on a DNA segment. DNA is denoted as a single black line. Before S-phase entry, four origins (denoted by I, II, III, and IV) are licensed by binding a double hexamer of MCM2-7 proteins (blue). As an origin fires, both MCM2-7 single hexamers are converted into an active Cdc45, Mcm2-7, and GINS complex helicase (pink). (A) RO II is dormant and passively replicated by the fork coming from RO I; replication is complete. (B) Red crosses depict the fork-stalling. Previously dormant RO II is fired to complete the replication of DNA between stalled forks. However, as there is no RO licensed between RO III and IV, the DNA between two stalled forks in this part remains unreplicated, and complete replication is compromised. Adapted from ref. .
Fig. S1.
Fig. S1.
Inter-RO distances from Besnard et al. (B) and Picard et al. (P) datasets are plotted. Due to the difference in resolution of detection, the minimum inter-RO distance in Picard et al. data is 4001 bp, and, in Besnard et al. data, it is 240 bp. The overlapping bar charts show that the two datasets are compatible. More detail of the compatibility of the two datasets is discussed in ref. .
Fig. 2.
Fig. 2.
Schematic of the central equation. The genome length is the dominant contributor to the overall replication error due to fork-stalling, followed by the number of licensed ROs and, lastly, by their distribution.
Fig. 3.
Fig. 3.
(A) Predicted probability of one or more DFSs for various eukaryotic genomes using the central equation from the model. (B) Measured mean replicon length across the same genomes from the corresponding experimental datasets. (C) Computed R values from the same eukaryotic datasets; note that the dashed bars represent simulated R values for virtual genomes of the same length and RO density but assuming ROs to be randomly distributed. (D) The probability of a DFS, denoted P(DFS), is plotted as a function of increasing replicon length. The estimated median fork-stalling distance, Ns (10 Mbp), is highlighted on the x axis. P(DFS) starts to increase sharply as soon as the replicon size reaches approximately half the value of Ns; note that the x axis has a log scale. (E) The calculated probability of a DFS inside replicons plotted against normalized chromosomal lengths for the largest chromosomes in budding yeast, Drosophila, Arabidopsis, and the IMR90 cell line from two human datasets (B and P).
Fig. 4.
Fig. 4.
(A) Measured lengths of the largest replicons are shown in each dataset alongside the dashed bars showing the value obtained for virtual genomes of the same length and RO density but assuming ROs to be randomly distributed. (B) The distribution of genome-wide replicon lengths plotted in boxplot format for budding yeast, Drosophila, Arabidopsis, and the IMR90 cell line from two human datasets (B and P).
Fig. 5.
Fig. 5.
Data are from the IMR90 human datasets (A, C, E, and G) B and (B, D, F, and H) P. (A and B) Frequency of replicons in each cohort, defined according to the following size ranges: <103 bp, XS; 103 to 104 bp, S; 104 to 105 bp, M; 105 to 106 bp, L; and >106 bp, XL. (C and D) Probability of DFS in each cohort of the replicons. (E and F) Higher-resolution plot of probability of DFS at the transition from M to L gap cohorts contributing most toward the P(DFS); red bars show the bins with maximum P(DFS) in respective datasets. (G and H) Theoretical frequency distribution of replicons inferred from E and F are presented in blue; gray shows the actual frequency distribution in those bins in the data, and red highlights the red bins in E and F.
Fig. S2.
Fig. S2.
Data are from HeLa human datasets (A, C, E, and G) B and (B, D, F, and H) P. (A and B) Frequency of replicons in each cohort, defined according to the following size ranges: <103 bp, XS; 103 to 104 bp, S; 104 to 105 bp, M; 105 to 106 bp, L; and >106 bp, XL. (C and D) Probability of DFS in each cohort of the replicons. (E and F) Higher-resolution plot of probability of DFS at the transition from M to L gap cohorts contributing most toward the P(DFS); red bars show the bins with maximum P(DFS) in respective datasets. (G and H) Theoretical frequency distribution of replicons inferred from E and F are presented in blue; gray shows the actual frequency distribution in those bins in the data, and red highlights the red bins in E and F.
Fig. S3.
Fig. S3.
Data are from hESC and K562 in human datasets (A, C, E, and G) B and (B, D, F, and H) P. (A and B) Frequency of replicons in each cohort, defined according to the following size ranges: <103 bp, XS; 103 to 104 bp, S; 104 to 105 bp, M; 105 to 106 bp, L; and >106 bp, XL. (C and D) Probability of DFS in each cohort of the replicons. (E and F) Higher-resolution plot of probability of DFS at the transition from M to L gap cohorts contributing most toward the P(DFS); red bars show the bins with maximum P(DFS) in respective datasets. (G and H) Theoretical frequency distribution of replicons inferred from E and F are presented in blue; gray shows the actual frequency distribution in those bins in the data, and red highlights the red bins in E and F.
Fig. S4.
Fig. S4.
Data are from iPSC human dataset in B. (A) Frequency of replicons in each cohort, defined according to the following size ranges: <103 bp, XS; 103 to 104 bp, S; 104 to 105 bp, M; 105 to 106 bp, L; and >106 bp, XL. (B) Probability of DFS in each cohort of the replicons. (C) Higher-resolution plot of probability of DFS at the transition from M to L gap cohorts contributing most toward the P(DFS); red bars show the bins with maximum P(DFS) in respective datasets. (D) Theoretical frequency distribution of replicons inferred from C are presented in blue; gray shows the actual frequency distribution in those bins in the data, and red highlights the red bins in C.
Fig. 6.
Fig. 6.
Theoretical prediction for the distribution of the number of DFSs based on the RO positions in each human cell-line dataset (using data from both B and P); also shown, as lines and dots, are best fits to a Poisson distribution.
Fig. 7.
Fig. 7.
(A) Experimental distribution of three different replicates of 53BP1 nuclear bodies in the IMR90 cell line fitted with a naïve Poisson (i.e., taking the mean of the data as λ) (gray) and a filtered Poisson (i.e., ignoring the frequencies of zero counts to account for potential error from immunofluorescence staining) (light gray). The single fitting with the average of the three replicates (not statistically different) is shown. (B) Experimental distribution of 53BP1 nuclear bodies in the U2-OS cell line fitted with a naïve Poisson (i.e., taking the mean of the data as λ) (gray) and a filtered Poisson (i.e., ignoring the frequencies of zero counts to account for potential error from immunofluorescence staining) (light gray). (C) Experimental distribution of UFBs in the U2-OS cell line fitted with a naïve Poisson (gray) and a filtered Poisson (light gray). (D) Values of the Possion parameter λ obtained from experimental fits of 53BP1 nuclear bodies in IMR90, U2-OS, and HeLa, and UFBs in U2-OS and HeLa, are compared with theoretical values obtained from different cell lines in Fig. 6.
Fig. 8.
Fig. 8.
(A) Based on the RO distributions in the various human datasets, theoretical predictions of the percentage of cells with DFSs are plotted as a function of the parameter Ns; the percentage is essentially 100% when Ns < 5 Mbp, and this percentage is still nontrivially high even when Ns > 20 Mbp. (B and C) Theoretical predictions of the probability of one, two, and three DFSs are shown as a function of Ns for the Besnard et al. (B) and Picard et al. (C) data. (D and E) Theoretical predictions of the probability of one, two, or three DFSs are shown as a function of Ns for the Besnard et al. (D) and Picard et al. (E) data. (F and G) Expected numbers of DFSs in different cell lines are plotted against Ns for the Besnard et al. (F) and Picard et al. (G) data; in black, blue, and red are the experimentally obtained expected number of 53BP1 nuclear bodies in IMR90, U2-OS, and HeLa, and UFBs in U2-OS and HeLa cell lines, respectively. Crossing points of the black, blue, and red lines over the curves provide an independent estimate for the plausible range of Ns (vertical lines) by directly comparing experimental data with theoretical predictions.
Fig. 9.
Fig. 9.
Highlighting the issues faced to maintain small DFS error rates for genomes of increasing length: theoretical prediction of the average replicon length as a function of increasing genome length, to maintain a fixed probability of DFS, for three different values of this probability. Diamonds show the positions of yeast, Arabidopsis, Drosophila, and human, obtained from the datasets of RO positions. The pink shadow highlights the biologically relevant range for mean replicon lengths as per all eukaryotic datasets available. The dashed red line marks the footprint for the MCM2-7 double hexamer, below which any replicon length is biologically unrealistic.

References

    1. Nielsen O, Løbner-Olesen A. Once in a lifetime: Strategies for preventing re-replication in prokaryotic and eukaryotic cells. EMBO Rep. 2008;9(2):151–156. - PMC - PubMed
    1. Bebenek A. DNA replication fidelity. Postepy Biochem. 2008;54(1):43–56. - PubMed
    1. Blow JJ, Ge XQ, Jackson DA. How dormant origins promote complete genome replication. Trends Biochem Sci. 2011;36(8):405–414. - PMC - PubMed
    1. Sclafani RA, Holzen TM. Cell cycle regulation of DNA replication. Annu Rev Genet. 2007;41:237–280. - PMC - PubMed
    1. Diffley JFX. Quality control in the initiation of eukaryotic DNA replication. Philos Trans R Soc Lond B Biol Sci. 2011;366(1584):3545–3553. - PMC - PubMed

Publication types