Accurate prediction of DNA N4-methylcytosine sites via boost-learning various types of sequence features
- PMID: 32917152
- PMCID: PMC7488740
- DOI: 10.1186/s12864-020-07033-8
Accurate prediction of DNA N4-methylcytosine sites via boost-learning various types of sequence features
Abstract
Background: DNA N4-methylcytosine (4mC) is a critical epigenetic modification and has various roles in the restriction-modification system. Due to the high cost of experimental laboratory detection, computational methods using sequence characteristics and machine learning algorithms have been explored to identify 4mC sites from DNA sequences. However, state-of-the-art methods have limited performance because of the lack of effective sequence features and the ad hoc choice of learning algorithms to cope with this problem. This paper is aimed to propose new sequence feature space and a machine learning algorithm with feature selection scheme to address the problem.
Results: The feature importance score distributions in datasets of six species are firstly reported and analyzed. Then the impact of the feature selection on model performance is evaluated by independent testing on benchmark datasets, where ACC and MCC measurements on the performance after feature selection increase by 2.3% to 9.7% and 0.05 to 0.19, respectively. The proposed method is compared with three state-of-the-art predictors using independent test and 10-fold cross-validations, and our method outperforms in all datasets, especially improving the ACC by 3.02% to 7.89% and MCC by 0.06 to 0.15 in the independent test. Two detailed case studies by the proposed method have confirmed the excellent overall performance and correctly identified 24 of 26 4mC sites from the C.elegans gene, and 126 out of 137 4mC sites from the D.melanogaster gene.
Conclusions: The results show that the proposed feature space and learning algorithm with feature selection can improve the performance of DNA 4mC prediction on the benchmark datasets. The two case studies prove the effectiveness of our method in practical situations.
Keywords: DNA N4-methylcytosine; Feature selection; Sequence feature; Site prediction.
Conflict of interest statement
The authors declare that they have no competing interests.
Figures
Similar articles
-
4mCBERT: A computing tool for the identification of DNA N4-methylcytosine sites by sequence- and chemical-derived information based on ensemble learning strategies.Int J Biol Macromol. 2023 Mar 15;231:123180. doi: 10.1016/j.ijbiomac.2023.123180. Epub 2023 Jan 13. Int J Biol Macromol. 2023. PMID: 36646347
-
Deep4mC: systematic assessment and computational prediction for DNA N4-methylcytosine sites by deep learning.Brief Bioinform. 2021 May 20;22(3):bbaa099. doi: 10.1093/bib/bbaa099. Brief Bioinform. 2021. PMID: 32578842 Free PMC article.
-
Exploring sequence-based features for the improved prediction of DNA N4-methylcytosine sites in multiple species.Bioinformatics. 2019 Apr 15;35(8):1326-1333. doi: 10.1093/bioinformatics/bty824. Bioinformatics. 2019. PMID: 30239627
-
Critical evaluation of web-based DNA N6-methyladenine site prediction tools.Brief Funct Genomics. 2021 Jul 17;20(4):258-272. doi: 10.1093/bfgp/elaa028. Brief Funct Genomics. 2021. PMID: 33491072 Review.
-
Advances in mapping the epigenetic modifications of 5-methylcytosine (5mC), N6-methyladenine (6mA), and N4-methylcytosine (4mC).Biotechnol Bioeng. 2021 Nov;118(11):4204-4216. doi: 10.1002/bit.27911. Epub 2021 Aug 20. Biotechnol Bioeng. 2021. PMID: 34370308 Review.
Cited by
-
DRSN4mCPred: accurately predicting sites of DNA N4-methylcytosine using deep residual shrinkage network for diagnosis and treatment of gastrointestinal cancer in the precision medicine era.Front Med (Lausanne). 2023 May 4;10:1187430. doi: 10.3389/fmed.2023.1187430. eCollection 2023. Front Med (Lausanne). 2023. PMID: 37215722 Free PMC article.
-
iDNA-ABF: multi-scale deep biological language learning model for the interpretable prediction of DNA methylations.Genome Biol. 2022 Oct 17;23(1):219. doi: 10.1186/s13059-022-02780-1. Genome Biol. 2022. PMID: 36253864 Free PMC article.
-
Comparative evaluation and analysis of DNA N4-methylcytosine methylation sites using deep learning.Front Genet. 2023 Aug 21;14:1254827. doi: 10.3389/fgene.2023.1254827. eCollection 2023. Front Genet. 2023. PMID: 37671040 Free PMC article.
-
m6Aminer: Predicting the m6Am Sites on mRNA by Fusing Multiple Sequence-Derived Features into a CatBoost-Based Classifier.Int J Mol Sci. 2023 Apr 26;24(9):7878. doi: 10.3390/ijms24097878. Int J Mol Sci. 2023. PMID: 37175594 Free PMC article.
-
Accurate Prediction of Anti-hypertensive Peptides Based on Convolutional Neural Network and Gated Recurrent unit.Interdiscip Sci. 2022 Dec;14(4):879-894. doi: 10.1007/s12539-022-00521-3. Epub 2022 Apr 27. Interdiscip Sci. 2022. PMID: 35474167
References
-
- Stoiber MH, Quick J, Egan R, Lee JE, Celniker SE, Neely R, Loman N, Pennacchio L, Brown JB. De novo identification of DNA modifications enabled by genome-guided nanopore signal processing. BioRxiv. 2016:094672.
-
- Korlach J, Turner SW. Going beyond five bases in DNA sequencing. Curr Opin Struct Biol. 2012;22(3):251–61. - PubMed
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Molecular Biology Databases