Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jun 17:15:193.
doi: 10.1186/1471-2105-15-193.

Phenotype prediction based on genome-wide DNA methylation data

Affiliations

Phenotype prediction based on genome-wide DNA methylation data

Thomas Wilhelm. BMC Bioinformatics. .

Abstract

Background: DNA methylation (DNAm) has important regulatory roles in many biological processes and diseases. It is the only epigenetic mark with a clear mechanism of mitotic inheritance and the only one easily available on a genome scale. Aberrant cytosine-phosphate-guanine (CpG) methylation has been discussed in the context of disease aetiology, especially cancer. CpG hypermethylation of promoter regions is often associated with silencing of tumour suppressor genes and hypomethylation with activation of oncogenes.Supervised principal component analysis (SPCA) is a popular machine learning method. However, in a recent application to phenotype prediction from DNAm data SPCA was inferior to the specific method EVORA.

Results: We present Model-Selection-SPCA (MS-SPCA), an enhanced version of SPCA. MS-SPCA applies several models that perform well in the training data to the test data and selects the very best models for final prediction based on parameters of the test data.We have applied MS-SPCA for phenotype prediction from genome-wide DNAm data. CpGs used for prediction are selected based on the quantification of three features of their methylation (average methylation difference, methylation variation difference and methylation-age-correlation). We analysed four independent case-control datasets that correspond to different stages of cervical cancer: (i) cases currently cytologically normal, but will later develop neoplastic transformations, (ii, iii) cases showing neoplastic transformations and (iv) cases with confirmed cancer. The first dataset was split into several smaller case-control datasets (samples either Human Papilloma Virus (HPV) positive or negative). We demonstrate that cytology normal HPV+ and HPV- samples contain DNAm patterns which are associated with later neoplastic transformations. We present evidence that DNAm patterns exist in cytology normal HPV- samples that (i) predispose to neoplastic transformations after HPV infection and (ii) predispose to HPV infection itself. MS-SPCA performs significantly better than EVORA.

Conclusions: MS-SPCA can be applied to many classification problems. Additional improvements could include usage of more than one principal component (PC), with automatic selection of the optimal number of PCs. We expect that MS-SPCA will be useful for analysing recent larger DNAm data to predict future neoplastic transformations.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Two parameters - used for final model selection. Each dot corresponds to one model that performs well in cross-validation in the training data. Each row corresponds to a given training dataset (name on the left), each column to the corresponding test dataset (name in header). For instance, the field row 1 (Normal) – column 4 (CIN2+(a)) shows the two parameters (x-axis Eval1, y-axis EV1dist) for all >300 models selected from the training dataset Normal (LOO-prediction-accuracy > 0.65), when applied to the test data CIN2+(a). For better visualization, the 10% of the models predicting the test data best are shown in red, the next 10% (between deciles 1 and 2) are coloured green and the next (between deciles 2 and 3) blue. Black dots represent the remaining 70%. Eval1 is the normalized largest eigenvalue of the covariance matrix taken from the methylation matrix of the test data. EV1dist is the Euclidean distance between the leading Eigenvectors of the model’s covariance matrix in the training data and in the test data.
Figure 2
Figure 2
Performance of prediction (AUC). Each row corresponds to a given training dataset, each column to a test dataset and each dot to one model. Models are ordered according to Eval1-EV1dist, rank 1 corresponds to the model with the largest value. Eval1 is the normalized largest eigenvalue of the covariance matrix taken from the methylation matrix of the test data. EV1dist is the Euclidean distance between the leading Eigenvectors of the model’s covariance matrix in the training data and in the test data. The red line shows the AUC resulting from cumulative risk scores (see Methods). The values of the red lines at model rank 5 are given in Table 6.
Figure 3
Figure 3
Description of models used for predictions (weights and # CpGs). Each row corresponds to a given training dataset, each column to a test dataset. Models are ordered according to Eval1-EV1dist, rank 1 corresponds to the model with the largest value. Eval1 is the normalized largest eigenvalue of the covariance matrix taken from the methylation matrix of the test data. EV1dist is the Euclidean distance between the leading Eigenvectors of the model’s covariance matrix in the training data and in the test data. The black line shows the mean number of CpGs used in the models up to the indicated rank, normalized by the maximum number of CpGs considered (1500). The other lines correspond to the mean weights (see Methods) used in the models up to the indicated rank. Blue lines correspond to average methylation difference (t- or MWU test), red to methylation variation difference (Bartlett’s or Levene’s test) and green to methylation-age-correlation. Solid lines indicate models taking into account both hyper- and hypomethylated CpGs. Dashed lines represent models using only hypermethylated and dotted lines indicate models using only hypomethylated CpGs.

References

    1. Bird A. DNA methylation patterns and epigenetic memory. Genes Dev. 2002;16:6–21. - PubMed
    1. Bock C. Analysing and interpreting DNA methylation data. Nat Rev Genet. 2012;13:705–719. - PubMed
    1. Rakyan VK, Down TA, Balding DJ, Beck S. Epigenome-wide association studies for common human diseases. Nat Rev Genet. 2011;12:529–541. - PMC - PubMed
    1. McKay JA, Mathers JC. Diet induced epigenetic changes and their implications for health. Acta Physiol (Oxf) 2011;202:103–118. - PubMed
    1. Slomko H, Heo HJ, Einstein FH. Minireview: Epigenetics of obesity and diabetes in humans. Endocrinology. 2012;153:1025–1030. - PMC - PubMed

Publication types