Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Aug 1;39(8):btad474.
doi: 10.1093/bioinformatics/btad474.

iCpG-Pos: an accurate computational approach for identification of CpG sites using positional features on single-cell whole genome sequence data

Affiliations

iCpG-Pos: an accurate computational approach for identification of CpG sites using positional features on single-cell whole genome sequence data

Sehi Park et al. Bioinformatics. .

Abstract

Motivation: The investigation of DNA methylation can shed light on the processes underlying human well-being and help determine overall human health. However, insufficient coverage makes it challenging to implement single-stranded DNA methylation sequencing technologies, highlighting the need for an efficient prediction model. Models are required to create an understanding of the underlying biological systems and to project single-cell (methylated) data accurately.

Results: In this study, we developed positional features for predicting CpG sites. Positional characteristics of the sequence are derived using data from CpG regions and the separation between nearby CpG sites. Multiple optimized classifiers and different ensemble learning approaches are evaluated. The OPTUNA framework is used to optimize the algorithms. The CatBoost algorithm followed by the stacking algorithm outperformed existing DNA methylation identifiers.

Availability and implementation: The data and methodologies used in this study are openly accessible to the research community. Researchers can access the positional features and algorithms used for predicting CpG site methylation patterns. To achieve superior performance, we employed the CatBoost algorithm followed by the stacking algorithm, which outperformed existing DNA methylation identifiers. The proposed iCpG-Pos approach utilizes only positional features, resulting in a substantial reduction in computational complexity compared to other known approaches for detecting CpG site methylation patterns. In conclusion, our study introduces a novel approach, iCpG-Pos, for predicting CpG site methylation patterns. By focusing on positional features, our model offers both accuracy and efficiency, making it a promising tool for advancing DNA methylation research and its applications in human health and well-being.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

Figure 1.
Figure 1.
Cell-wise methylated and non-methylated, training, and testing data size visualization of HCCs and HepG2 datasets.
Figure 2.
Figure 2.
iCpG-Pos development flowchart in schematic form. It comprises of dataset creation, feature extraction, baseline model building, and stacking-predictor development.
Figure 3.
Figure 3.
Visual illustration for extraction of positional features.
Figure 4.
Figure 4.
Evaluation of different feature extraction and classification techniques on two adopted benchmark datasets HCCs (a) and HepG2 (b). The box plot shows the average values and the variation over of all cells in the dataset. Acronyms are; Random Forest (RF), AdaBoost(AB); eXtreme Gradient Boosting (XGB), Sensitivity (Sn), Specificity(Sp), Accuracy (ACC), Matthews correlation coefficient (MCC), and Area Under the Receiver Operating Characteristic Curve (AUC).
Figure 5.
Figure 5.
t-distributed stochastic neighbor embedding (t-SNE) positional feature distribution of CpG and Non-CpG sites.
Figure 6.
Figure 6.
Using the HCCs (a) and HepG2 (b) dataset to illustrate the distribution of evaluation metrics. “L” stands for our LightCpG, “Z” stands for Zhang’s method, “PC” represents proposed algorithm with CatBoost, and “PS” represents proposed algorithm with Stacking framework.

Similar articles

Cited by

References

    1. Akiba T, Sano S, Yanase T et al. Optuna: A next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage AK, USA, August 4–8, 2019. New York, NY, United States: Association for Computing Machinery, 2019, 2623–31.
    1. Angermueller C, Lee HJ, Reik W et al. DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning. Genome Biol 2017;18:1–13. - PMC - PubMed
    1. Bhasin M, Zhang H, Reinherz EL et al. Prediction of methylated CpGs in DNA sequences using a support vector machine. FEBS Lett 2005;579:4302–8. - PubMed
    1. Chicco D, Tötsch N, Jurman G. The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation. BioData Min 2021;14:13–22. - PMC - PubMed
    1. Chou K-C. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins 2001;43:246–55. - PubMed

Publication types

Substances