Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Apr 23;20(1):306.
doi: 10.1186/s12864-019-5654-9.

LightCpG: a multi-view CpG sites detection on single-cell whole genome sequence data

Affiliations

LightCpG: a multi-view CpG sites detection on single-cell whole genome sequence data

Limin Jiang et al. BMC Genomics. .

Erratum in

Abstract

Background: DNA methylation plays an important role in multiple biological processes that are closely related to human health. The study of DNA methylation can provide an insight into the mechanism behind human health and can also have a positive effect on the assessment of human health status. However, the available sequencing technology is limited by incomplete CpG coverage. Therefore, it is crucial to discover an efficient and convenient method capable of distinguishing between the states of CpG sites. Previous studies focused on identifying methylation states of the CpG sites in single cell, which only evaluated sequence information or structural information.

Results: In this paper, we propose a novel model, LightCpG, which combines the positional features with the sequence and structural features to provide information on the CpG sites at two stages. Next, we used the LightGBM model for training of the CpG site identification, and further utilized sample extraction and merged features to reduce the training time. Our results indicate that our method achieves outstanding performance in recognition of DNA methylation. The average AUC values of our method using the 25 human hepatocellular carcinoma cells (HCC) cell datasets and six human heptoplastoma-derived (HepG2) cell datasets were 0.9616 and 0.9213, respectively. Moreover, the average training times for our method on the HCC and HepG2 datasets were 8.3 and 5.06 s, respectively. Furthermore, the computational complexity of our model was much lower compared with other available methods that detect methylation states of the CpG sites.

Conclusions: In summary, LightCpG is an accurate model for identifying the DNA methylation status of CpG sites in single cells. Furthermore, three types of feature extraction methods and two strategies used in LightCpG are helpful for other prediction problems.

Keywords: DNA methylation; LightGBM; Positional features; Sequence features; Structural features.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
The flow chart of LightCpG. CpG profiles are obtained from scTrio-seq. Dataset includes multiple single-cell CpG profiles. Feature extraction: positional feature includes methylation state and the distance between the sites; structural feature includes CpG islands (CGIs) status (CGIs, CGIs shore, CGIs shelf), cis-regulatory elements (TFBS, DNase, chromatin states, histone modification), and DNA properties (integrated haplotype score (iHS), constrain score); sequence feature includes 84 dimension features that are extracted using DNA sequence and n-gram method. Training: LightGBM is used to construct a model for each single-cell CpG data; sample selection is used to reduce the number of samples; feature merging is used to reduce the number of features. Testing: the trained LightCpG model can be used for prediction of the new CpG sites
Fig. 2
Fig. 2
The sketch map of skip-k method
Fig. 3
Fig. 3
The sketch map of feature F and feature D
Fig. 4
Fig. 4
The flow chart of GOSS and EFB. (1) GOSS was used to reduce the number of samples. First, we sorted all data samples according to the gradient values. Then, the top a% samples were extracted and b% of the remaining samples were randomly selected. (2) EFB was used to reduce the number of features. First, we bundle multiple sparse features into one set and then combined a set into one feature with the help pf a histogram
Fig. 5
Fig. 5
The performance of different k1 values on the GM12878 dataset
Fig. 6
Fig. 6
The performance of different k1 values on the heart left ventricle dataset
Fig. 7
Fig. 7
The performance of different k2 values on the HCCs dataset
Fig. 8
Fig. 8
The importance score for all features on the HCCs dataset
Fig. 9
Fig. 9
The trends of accuracy on the feature dimension in all cells on the HCCs dataset
Fig. 10
Fig. 10
The AUCs of different feature extraction methods analyzed using the HCCs dataset. RF Ours uses ours features to train the RF model. RF Deep uses DeepCpG features to train the RF model. RF Zhang uses Zhang’s features to train the RF model
Fig. 11
Fig. 11
The AUCs of different feature extraction methods in the HepG2 dataset. RF Ours uses ours features to train the RF model. RF Deep uses DeepCpG features to train the RF model. RF Zhang uses Zhang’s features to train the RF model
Fig. 12
Fig. 12
The distribution of the evaluation values using the HCCs dataset. O represents our method, D represents the DeepCpG method, and Z represents the method of RF Zhang
Fig. 13
Fig. 13
The distribution of the evaluation values using the HepG2 dataset. O represents our method, D represents the DeepCpG method, and Z represents the method of RF Zhang

Similar articles

Cited by

References

    1. Zhang W, Spector TD, Deloukas P, et al. Predicting genome-wide DNA methylation using methylation marks, genomic position, and DNA regulatory elements. Genome Biol. 2015;16(1):1–20. - PMC - PubMed
    1. Suzuki MM, Adrian B. DNA methylation landscapes: provocative insights from epigenomics. Nat Rev Genet. 2008;9(6):465. - PubMed
    1. Bianchi C, Zangi R. Molecular dynamics study of the recognition of dimethylated CpG sites by MBD1 protein. J Chem Inf Model. 2015;55(3):636. - PubMed
    1. Gao D, Zhu B, Sun H. In: Mitochondrial DNA Methylation and Related Disease. Singapore: Springer Singapore: 2017. p. 117–32. - PubMed
    1. Wan J, Oliver VF, Wang G, et al. Characterization of tissue-specific differential DNA methylation suggests distinct modes of positive and negative gene expression regulation. BMC Genomics. 2015;16(1):49. - PMC - PubMed

LinkOut - more resources