Phenotype prediction based on genome-wide DNA methylation data

Thomas Wilhelm¹

Affiliations

PMID: 24934728
PMCID: PMC4073816
DOI: 10.1186/1471-2105-15-193

Phenotype prediction based on genome-wide DNA methylation data

Thomas Wilhelm. BMC Bioinformatics. 2014.

. 2014 Jun 17:15:193.

doi: 10.1186/1471-2105-15-193.

Author

Thomas Wilhelm¹

Affiliation

¹ Theoretical Systems Biology, Institute of Food Research, Norwich Research Park, Norwich NR4 7UA, UK. Thomas.wilhelm@ifr.ac.uk.

PMID: 24934728
PMCID: PMC4073816
DOI: 10.1186/1471-2105-15-193

Abstract

Background: DNA methylation (DNAm) has important regulatory roles in many biological processes and diseases. It is the only epigenetic mark with a clear mechanism of mitotic inheritance and the only one easily available on a genome scale. Aberrant cytosine-phosphate-guanine (CpG) methylation has been discussed in the context of disease aetiology, especially cancer. CpG hypermethylation of promoter regions is often associated with silencing of tumour suppressor genes and hypomethylation with activation of oncogenes.Supervised principal component analysis (SPCA) is a popular machine learning method. However, in a recent application to phenotype prediction from DNAm data SPCA was inferior to the specific method EVORA.

Results: We present Model-Selection-SPCA (MS-SPCA), an enhanced version of SPCA. MS-SPCA applies several models that perform well in the training data to the test data and selects the very best models for final prediction based on parameters of the test data.We have applied MS-SPCA for phenotype prediction from genome-wide DNAm data. CpGs used for prediction are selected based on the quantification of three features of their methylation (average methylation difference, methylation variation difference and methylation-age-correlation). We analysed four independent case-control datasets that correspond to different stages of cervical cancer: (i) cases currently cytologically normal, but will later develop neoplastic transformations, (ii, iii) cases showing neoplastic transformations and (iv) cases with confirmed cancer. The first dataset was split into several smaller case-control datasets (samples either Human Papilloma Virus (HPV) positive or negative). We demonstrate that cytology normal HPV+ and HPV- samples contain DNAm patterns which are associated with later neoplastic transformations. We present evidence that DNAm patterns exist in cytology normal HPV- samples that (i) predispose to neoplastic transformations after HPV infection and (ii) predispose to HPV infection itself. MS-SPCA performs significantly better than EVORA.

Conclusions: MS-SPCA can be applied to many classification problems. Additional improvements could include usage of more than one principal component (PC), with automatic selection of the optimal number of PCs. We expect that MS-SPCA will be useful for analysing recent larger DNAm data to predict future neoplastic transformations.

PubMed Disclaimer

Figures

**Figure 1**
**Two parameters - used for final model selection.** Each dot corresponds to one model that performs well in cross-validation in the training data. Each row corresponds to a given training dataset (name on the left), each column to the corresponding test dataset (name in header). For instance, the field row 1 (Normal) – column 4 (CIN2+(a)) shows the two parameters (x-axis *Eval1*, y-axis *EV1dist*) for all >300 models selected from the training dataset Normal (LOO-prediction-accuracy > 0.65), when applied to the test data CIN2+(a). For better visualization, the 10% of the models predicting the test data best are shown in red, the next 10% (between deciles 1 and 2) are coloured green and the next (between deciles 2 and 3) blue. Black dots represent the remaining 70%. *Eval1* is the normalized largest eigenvalue of the covariance matrix taken from the methylation matrix of the test data. *EV1dist* is the Euclidean distance between the leading Eigenvectors of the model’s covariance matrix in the training data and in the test data.

**Figure 2**
**Performance of prediction (AUC).** Each row corresponds to a given training dataset, each column to a test dataset and each dot to one model. Models are ordered according to *Eval1*-*EV1dist*, rank 1 corresponds to the model with the largest value. *Eval1* is the normalized largest eigenvalue of the covariance matrix taken from the methylation matrix of the test data. *EV1dist* is the Euclidean distance between the leading Eigenvectors of the model’s covariance matrix in the training data and in the test data. The red line shows the AUC resulting from cumulative risk scores (see Methods). The values of the red lines at model rank 5 are given in Table 6.

**Figure 3**
**Description of models used for predictions (weights and # CpGs).** Each row corresponds to a given training dataset, each column to a test dataset. Models are ordered according to *Eval1*-*EV1dist*, rank 1 corresponds to the model with the largest value. *Eval1* is the normalized largest eigenvalue of the covariance matrix taken from the methylation matrix of the test data. *EV1dist* is the Euclidean distance between the leading Eigenvectors of the model’s covariance matrix in the training data and in the test data. The black line shows the mean number of CpGs used in the models up to the indicated rank, normalized by the maximum number of CpGs considered (1500). The other lines correspond to the mean weights (see Methods) used in the models up to the indicated rank. Blue lines correspond to average methylation difference (t- or MWU test), red to methylation variation difference (Bartlett’s or Levene’s test) and green to methylation-age-correlation. Solid lines indicate models taking into account both hyper- and hypomethylated CpGs. Dashed lines represent models using only hypermethylated and dotted lines indicate models using only hypomethylated CpGs.

See this image and copyright information in PMC

References

1. Bird A. DNA methylation patterns and epigenetic memory. Genes Dev. 2002;16:6–21. - PubMed
1. Bock C. Analysing and interpreting DNA methylation data. Nat Rev Genet. 2012;13:705–719. - PubMed
1. Rakyan VK, Down TA, Balding DJ, Beck S. Epigenome-wide association studies for common human diseases. Nat Rev Genet. 2011;12:529–541. - PMC - PubMed
1. McKay JA, Mathers JC. Diet induced epigenetic changes and their implications for health. Acta Physiol (Oxf) 2011;202:103–118. - PubMed
1. Slomko H, Heo HJ, Einstein FH. Minireview: Epigenetics of obesity and diabetes in humans. Endocrinology. 2012;153:1025–1030. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

BB/J004529/1/Biotechnology and Biological Sciences Research Council/United Kingdom

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Phenotype prediction based on genome-wide DNA methylation data

Affiliation

Phenotype prediction based on genome-wide DNA methylation data

Author

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources