Risk prediction using genome-wide association studies

Charles Kooperberg¹, Michael LeBlanc, Valerie Obenchain

Affiliations

PMID: 20842684
PMCID: PMC2964405
DOI: 10.1002/gepi.20509

Risk prediction using genome-wide association studies

Charles Kooperberg et al. Genet Epidemiol. 2010 Nov.

. 2010 Nov;34(7):643-52.

doi: 10.1002/gepi.20509.

Authors

Charles Kooperberg¹, Michael LeBlanc, Valerie Obenchain

Affiliation

¹ Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington 98109-1024, USA. clk@fhcrc.org

PMID: 20842684
PMCID: PMC2964405
DOI: 10.1002/gepi.20509

Abstract

Over the last few years, many new genetic associations have been identified by genome-wide association studies (GWAS). There are potentially many uses of these identified variants: a better understanding of disease etiology, personalized medicine, new leads for studying underlying biology, and risk prediction. Recently, there has been some skepticism regarding the prospects of risk prediction using GWAS, primarily motivated by the fact that individual effect sizes of variants associated with the phenotype are mostly small. However, there have also been arguments that many disease-associated variants have not yet been identified; hence, prospects for risk prediction may improve if more variants are included. From a risk prediction perspective, it is reasonable to average a larger number of predictors, of which some may have (limited) predictive power, and some actually may be noise. The idea being that when added together, the combined small signals results in a signal that is stronger than the noise from the unrelated predictors. We examine various aspects of the construction of models for the estimation of disease probability. We compare different methods to construct such models, to examine how implementation of cross-validation may influence results, and to examine which single nucleotide polymorphisms (SNPs) are most useful for prediction. We carry out our investigation on GWAS of the Welcome Trust Case Control Consortium. For Crohn's disease, we confirm our results on another GWAS. Our results suggest that utilizing a larger number of SNPs than those which reach genome-wide significance, for example using the lasso, improves the construction of risk prediction models.

PubMed Disclaimer

Figures

**Figure 1**
Log-likelihood for the WTCCC Crohn’s disease data using three different ways to carry out the pre-selection of significant SNPs in relation to the cross-validation. The training data log-likelihood was rescaled by a factor of 1878/2808 to be on the same scale as the test data log-likelihood.

**Figure 2**
Log-likelihood and AUC for the WTCCC Crohn’s disease data for prediction models for test and training data. The training data log-likelihood was rescaled by a factor of 1878/2808 to be on the same scale as the test data log-likelihood. Note that not all SNPs considered have nonzero coefficients, see Table 1. The log-likelihood for stepwise GLM using AIC (GLM-AIC) is −1287.7. The insert figure at the left bottom vertically expands the curves for the models with 100 SNPs or less.

**Figure 3**
Which SNPs are and are not used with nonzero coefficients for the lasso model and other prediction models for the WTCCC Crohn’s disease data. The SNPs are ordered on the horizontal axis by significance. The vertical stripes suggest that frequently the same SNPs are selected.

**Figure 4**
ROC curves for the WTCCC Crohn’s disease data for prediction for lasso models considering different numbers of SNPs. The AUCs for these models are displayed in Table 1.

**Figure 5**
Smoothed estimates of the probability of being a case as a function of the predicted probability of being a case with 95% confidence intervals for the WTCCC Crohn’s disease data. The steeper curves for the training data suggest some overfitting, while the test data appears better calibrated.

**Figure 6**
Comparison of test data AUC for the NIDDK and WTCCC data. The model for the NIDDK data is trained on the complete WTCCC data, the model for the test part of the WTCCC data is trained on the training part of the WTCCC data.

See this image and copyright information in PMC

References

1. Cook NR. Use and misuse of the Receiver Operating Characteristic curve in risk prediction. Circulation. 115:928–935. - PubMed
1. Donoho DL, Johnstone IM. Ideal spatial adaptation via wavelet shrinkage. Biometrika. 1994;81:425–455.
1. Duerr RH, Taylor KD, Brant SR, Rioux JD, Silverberg MS, Daly MJ, Steinhart AH, Abraham C, Regueiro M, Griffiths A, Dassopoulos T, Bitton A, Yang H, Targan S, Data LW, Kistner EO, Schumm P, Lee AT, Gregersen PK, Barmada MM, Rotter JI, Nicolae DL, Cho JH. A genome-wide association study identifies IL23R as an inflammatory bowel disease gene. Science. 2006;314:1461–1463. - PMC - PubMed
1. Evans DM, Visscher PM, Wray NM. Harnessing the information contained with genome-wide association studies to improve individual prediction of complex disease risk. Hum Mol Gen. 2009;18:3525–3531. - PubMed
1. Gail MH. Value of adding single-nucleotide polymorphism genotypes to a breast cancer risk model. J Nat Can Inst. 2009;101:959–963. - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Risk prediction using genome-wide association studies

Affiliation

Risk prediction using genome-wide association studies

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources