Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Nov 22;25(1):bbad432.
doi: 10.1093/bib/bbad432.

Unveiling human origins of replication using deep learning: accurate prediction and comprehensive analysis

Affiliations

Unveiling human origins of replication using deep learning: accurate prediction and comprehensive analysis

Zhen-Ning Yin et al. Brief Bioinform. .

Abstract

Accurate identification of replication origins (ORIs) is crucial for a comprehensive investigation into the progression of human cell growth and cancer therapy. Here, we proposed a computational approach Ori-FinderH, which can efficiently and precisely predict the human ORIs of various lengths by combining the Z-curve method with deep learning approach. Compared with existing methods, Ori-FinderH exhibits superior performance, achieving an area under the receiver operating characteristic curve (AUC) of 0.9616 for K562 cell line in 10-fold cross-validation. In addition, we also established a cross-cell-line predictive model, which yielded a further improved AUC of 0.9706. The model was subsequently employed as a fitness function to support genetic algorithm for generating artificial ORIs. Sequence analysis through iORI-Euk revealed that a vast majority of the created sequences, specifically 98% or more, incorporate at least one ORI for three cell lines (Hela, MCF7 and K562). This innovative approach could provide more efficient, accurate and comprehensive information for experimental investigation, thereby further advancing the development of this field.

Keywords: Z-curve method; deep learning; human genome; origin of replication.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The flowchart of Ori-FinderH. (A) Schematic diagram of Ori-FinderH model. The entire model is primarily composed of three structure blocks, each containing a self-attention layer, convolutional layer, average pooling layer and up-sampling layer. Here, binary output is obtained via an MLP. (B) ELU function diagram. ELU function has negative values; therefore, the average activation value is close to 0, which makes learning faster with gradients closer to natural ones.
Figure 2
Figure 2
The flowchart of genetic algorithm. Each DNA sequence is denoted by an individual dot.
Figure 3
Figure 3
Comprehensive performance of the present model. (AC) Cross-cell-line performance comparison between with and without attention for ACC (A), MCC (B) and AUC (C), respectively. (D) Comparison of different parameters and methods. (EG) Cross-cell-line performance comparison between iORI-Epi and Ori-FinderH in HCT116 (E), K562 (F), MCF7 (G), respectively. Note: * means P < 0.05, ** means P < 0.01, *** means P < 0.001.
Figure 4
Figure 4
Statistical analysis of GC contents and GC-rich motifs in ORIs. (A) Statistical diagram of GC content distribution for different sequence sets. Li represents GC content between 10(i–1)% and 10i% (1 ≤ i ≤ 10). (B) The GC-rich motifs discovered in the ORIs from different cell lines.
Figure 5
Figure 5
The cosine distance and dimensionality reduction analysis between different sequences after Z-curve encoding. (A) The cosine distance between different sequences after Z-curve encoding. (B) The result of dimensionality reduction analysis using LLE on the sequences after Z-curve encoding. (C, D) The result of dimensionality reduction analysis using t-SNE on the sequences after Z-curve encoding.
Figure 6
Figure 6
The performance of 10-fold cross-validation test between models established for different cell lines and cross-cell-line models.
Figure 7
Figure 7
Analytical results of generated ORIs. (A) The GC content and fitness for generated ORIs from the initial sequences containing only A, which is abbreviated to Init (A). (B) The GC content and fitness for generated ORIs from the initial sequences containing only G, which is abbreviated to Init (G). (C) The test results obtained by using iORI-Euk to discriminate the generated ORIs.

Similar articles

Cited by

References

    1. Bleichert F, Botchan MR, Berger JM. Mechanisms for initiating cellular DNA replication. Science 2017;355:eaah6317. - PubMed
    1. Bryant JA, Aves SJ. Initiation of DNA replication: functional and evolutionary aspects. Ann Bot 2011;107:1119–26. - PMC - PubMed
    1. Dong MJ, Luo H, Gao F. Ori-Finder 2022: a comprehensive web server for prediction and analysis of bacterial replication origins. Genomics Proteomics Bioinformatics 2022;20:1207–13. - PMC - PubMed
    1. Dong MJ, Luo H, Gao F. DoriC 12.0: an updated database of replication origins in both complete and draft prokaryotic genomes. Nucleic Acids Res 2023;51:D117–20. - PMC - PubMed
    1. Luo H, Gao F. DoriC 10.0: an updated database of replication origins in prokaryotic genomes including chromosomes and plasmids. Nucleic Acids Res 2019;47:D74–7. - PMC - PubMed

Publication types