Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Feb 28;51(4):e22.
doi: 10.1093/nar/gkac1239.

OKseqHMM: a genome-wide replication fork directionality analysis toolkit

Affiliations

OKseqHMM: a genome-wide replication fork directionality analysis toolkit

Yaqun Liu et al. Nucleic Acids Res. .

Abstract

During each cell division, tens of thousands of DNA replication origins are co-ordinately activated to ensure the complete duplication of the human genome. However, replication fork progression can be challenged by many factors, including co-directional and head-on transcription-replication conflicts (TRC). Head-on TRCs are more dangerous for genome integrity. To study the direction of replication fork movement and TRCs, we developed a bioinformatics toolkit called OKseqHMM (https://github.com/CL-CHEN-Lab/OK-Seq, https://doi.org/10.5281/zenodo.7428883). Then, we used OKseqHMM to analyse a large number of datasets obtained by Okazaki fragment sequencing to directly measure the genome-wide replication fork directionality (RFD) and to accurately predict replication initiation and termination at a fine resolution in organisms including yeast, mouse and human. We also successfully applied our analysis to other genome-wide sequencing techniques that also contain RFD information (e.g. eSPAN, TrAEL-seq). Our toolkit can be used to predict replication initiation and fork progression direction genome-wide in a wide range of cell models and growth conditions. Comparing the replication and transcription directions allows identifying loci at risk of TRCs, particularly head-on TRCs, and investigating their role in genome instability by checking DNA damage data, which is of prime importance for human health.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Schematic presentation of the data analysis pipeline in the OKseqHMM toolkit. Raw sequencing data are pre-processed into aligned files with the bioinformatics tools indicated in blue (Pre-processing panel). The middle panel shows the major functions of the OKseqHMM toolkit. The first function (OKseqHMM) checks the input aligned bam files to determine whether they are single- or paired-end sequencing data, and then automatically splits the reads into Watson and Crick strands and computes the replication fork directionality (RFD) (OKseqHMM panel). By default, the calculation is performed within 1 kb adjacent windows (recommended for human cells) and then smoothed into 15 kb sliding windows with 1 kb step. These parameters can be easily adjusted based on the nature of the data. Different replication features, i.e. initiation zones (IZ), two intermediate states and termination zones (TZ), are predicted using an HMM algorithm (see the graphic presentation of the four HMM states and their dynamic transition scheme). The second function (OKseqOEM) uses the reads on Watson and Crick strands to generate origin efficiency metrics (OEM) at multiple scales to visualize the RFD transition. The last function allows users to generate an average metagene profile and heatmap to analyse the RFD and OEM distributions around the genes/regions of interest. Results can be visualized in genomic visualization browsers (such as IGV), as shown in the right panel.
Figure 2.
Figure 2.
Schematic presentation of the HMM algorithm for initiation and termination zone detection. (A) The 4-state HMM model used in the segmentation process: Up, regions of predominant initiation (IZ); Down, regions of predominant termination (TZ); Flat1 and Flat2, two intermediate transition states. (B) Default state transition probability (between states) and emission probabilities (probabilities of each state within five quantiles of the formula image values) used in OKseqHMM (see Materials and Methods for detail). The probability matrixes were colour-coded based on their values (higher probability values are closer to red). (C and D) Examples of RFD profiles in chromosome 1 of human HeLa cells with the corresponding IZs, TZs and two Flat states identified by OKseqHMM. Each point on the RFD profile gives the RFD value calculated within each 1 kb adjacent window, and the windows with positive and negative RFD values are shown in red and blue, respectively.
Figure 3.
Figure 3.
Analysis of yeast OK-seq data by OKseqHMM. (A) RFD profile calculated at the 50 bp resolution with the corresponding IZs identified by OKseqHMM that are highly correlated with the confirmed yeast ARS from OriDB (29). The RFD profile is like in Figure 2C, but with a 50 bp resolution (instead of 1 kb). The lower part of the panel shows the OEM profiles calculated from the 50 bp to the 25 kb scale. The windows with positive and negative OEM values are shown in red and blue, respectively. (B) Length distribution of the detected OK-seq IZs. (C) Venn diagram showing the overlap of OK-seq IZs with all the known yeast origins (ARS) from OriDB clustered in three classes (confirmed, likely, and dubious). Overlap means that the closest distance between the centres of the IZ and ARS is <2 kb. Note that not all confirmed OriDB origins overlapped with OK-seq IZs because all origins might not be active in the culture condition and/or yeast strain used for the OK-seq experiment. Further comparison with origins identified in datasets obtained with other techniques can be found in Figure 7B. (D) Boxplot showing the distribution (in red) of distances between the centre of an IZ detected by OKseqHMM and the centre of the closest origin from OriDB (grouped in three classes). Such distances are much smaller than the distances between OriDB origins and the random simulation control (in black), as indicated by the Wilcoxon rank sum test; **P < 10−2, ***P < 10−3, ****P < 10−4.
Figure 4.
Figure 4.
Analysis of HeLa cell OK-seq data by OKseqHMM. (A) Replication timing profile obtained by Repli-seq, RFD profiles and the corresponding IZs detected in a publicly available HeLa MRL2 OK-seq dataset (6) and in OK-seq data of HeLa S3 cells generated in the current study, OEM profiles of HeLa S3 cells at the 1 kb to 1 Mb scales, and transcription data provided by GRO-seq and RNA-seq along a ∼4 Mb region on chromosome 1. (B) Venn diagrams showing that 67% of OK-seq IZs were shared by the two HeLa cell lines and up to 80% when only the early IZs were considered (i.e. replication timing S50 <0.4). (C) Mean OEM profiles and OEM heatmaps (the colour scale is indicated on the right) around the HeLa S3 IZ centres at the indicated scales (10, 20, 50 and 100 kb) sorted by the length of the detected IZs.
Figure 5.
Figure 5.
OKseqHMM reveals the coordination between DNA replication and gene transcription. (A) Mean profiles and heatmaps of transcription activity (RNA-seq and GRO-seq data) around the HeLa S3 OK-seq IZ centres. (B) Mean profile and heatmap of HeLa S3 RFD between the TSS and the TTS of active genes with an extension of ±50 kb. The heatmap colour scales are indicated in each panel.
Figure 6.
Figure 6.
Comparison of the genome-wide RFD profiles of different human cell lines shows the cell type-specific replication program. (A) Cell type-specific RFD profiles and the corresponding IZs detected in the indicated human cell lines. (B) Pairwise Pearson correlations between OK-seq RFD data (1 kb) for the indicated human cell lines.
Figure 7.
Figure 7.
Genome-wide RFD profiles obtained using TrAEL-seq and eSPAN datasets. (A) RFD profiles and the corresponding IZs (50 bp bin size) from yeast OK-seq and TrAEL-seq datasets (11). The known yeast origins (ARS) were downloaded from OriDB. Arrows indicate two IZs (chrIV:1447102–1448300, chrIV:1486452–1486950) identified in the TrAEL-seq dataset but not in the OK-seq dataset. (B) Venn diagram showing the overlap between OK-seq IZs (n = 348), TrAEL-seq IZs (n = 380), FORK-seq initiation events (n = 4964), and known origins (ARS) from OriDB (n = 829); overlap means that the closest distance between the centre of an IZ and of an ARS is <2 kb. When origins in one dataset overlapped with several origins in the other datasets, only one number was provided with the following priority order: OK-seq > TrAEL-seq > OriDB > FORK-seq. It should be noted that there are more origins unique to FORK-seq because this is a single-molecule technique that allows identifying also initiation events with very low frequency. (C) Metagene average RFD profiles computed using the OK-seq and eSPAN datasets from mouse embryonic stem cells (8). The mean and standard error bands are shown only for the eSPAN dataset because the standard error bands for the OK-seq dataset were too narrow to be seen.

References

    1. Gnan S., Liu Y., Spagnuolo M., Chen C.-L. The impact of transcription-mediated replication stress on genome instability and human disease. Genome Instab. Dis. 2020; 1:207–234.
    1. Hamperl S., Bocek M.J., Saldivar J.C., Swigut T., Cimprich K.A. Transcription-Replication conflict orientation modulates R-loop levels and activates distinct DNA damage responses. Cell. 2017; 170:774–786. - PMC - PubMed
    1. Merrikh H. Spatial and temporal control of evolution through replication–transcription conflicts. Trends Microbiol. 2017; 25:515–521. - PMC - PubMed
    1. Chen C.-L., Duquenne L., Audit B., Guilbaud G., Rappailles A., Baker A., Huvet M., D’Aubenton-Carafa Y., Hyrien O., Arneodo A. et al. . Replication-associated mutational asymmetry in the human genome. Mol. Biol. Evol. 2011; 28:2327–2337. - PubMed
    1. Huvet M., Nicolay S., Touchon M., Audit B., Aubenton-Carafa Y., Arneodo A., Thermes C. Human gene organization driven by the coordination of replication and transcription. Genome Res. 2007; 17:1278–1285. - PMC - PubMed

Publication types