Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2013 Jul;14(7):507-15.
doi: 10.1038/nrg3457.

Pitfalls of predicting complex traits from SNPs

Affiliations
Review

Pitfalls of predicting complex traits from SNPs

Naomi R Wray et al. Nat Rev Genet. 2013 Jul.

Abstract

The success of genome-wide association studies (GWASs) has led to increasing interest in making predictions of complex trait phenotypes, including disease, from genotype data. Rigorous assessment of the value of predictors is crucial before implementation. Here we discuss some of the limitations and pitfalls of prediction analysis and show how naive implementations can lead to severe bias and misinterpretation of results.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Flowchart of SNP-based prediction analysis. There are three stages for the development of a risk predictor – discovery, validation and application. At each stage data is needed as an input, a process is applied to the data and a result is generated. a. At this stage effect sizes estimated from combined discovery and validation samples can be used.
Figure 2
Figure 2. Examples of the overlap pitfall: non-independence of discovery and validation samples
a) Human: High R2 can be achieved by chance particularly when sample size is small. We simulated GWAS data based upon real human genotype data under the null hypothesis of no association. We used data of 11,586 unrelated European Americans genotyped on 563,212 SNPs . We randomly sampled N individuals and selected top SNPs for height at p < 10−5 (red bar) and p < 10−4 (blue bar) to predict the phenotype in the same data. We also performed association analysis for real height phenotype in 10,000 individuals and selected top SNPs at p < 10−5 (green bar) and p < 10−4 (purple bar) to predict height phenotype in the same sample. The graph shows the mean prediction R2 over 100 simulation replicates. Error bar: standard error of the mean. The number on top of each column is the mean number of selected SNPs over 100 simulation replicates. b) Drosophila: An example, illustrating bias when selecting the top SNPs. We downloaded genotype data of the Drosophila Genetics Reference Panel and simulated phenotypes under the null hypothesis, i.e., random association between each of the > 1 million SNPs and phenotype. We repeated the GWAS analysis reported in, selecting the top 10 independently associated SNPs and predicted the phenotypes of the lines using these 10 SNPs. Since in the simulated data there are only random associations between SNP and phenotype any prediction power is false and result of over-fitting. By chance, the top SNPs (in terms of test statistic) explain 57% (R2=57%) of the phenotypic variance between the inbred lines, from a linear regression of phenotype on predictor. Both phenotype and predictor have been standardized to normal distribution z-scores (mean of zero and standard deviation of one). c) Dairy Cattle: The impact of leaving the validation cohort in the discovery set, either at both SNP selection (GWAS) and SNP effect estimation stages, or at the effect size estimation stage only. Data shown are from 2,732 bulls with ~500K SNPs phenotyped for average milk yield of their daughters’ milk production. The bulls were split into a discovery sample (bulls born during or before 2003), Nd = 2,458, and a validation sample (bulls born after 2003) of Nv= 274.
Figure 2
Figure 2. Examples of the overlap pitfall: non-independence of discovery and validation samples
a) Human: High R2 can be achieved by chance particularly when sample size is small. We simulated GWAS data based upon real human genotype data under the null hypothesis of no association. We used data of 11,586 unrelated European Americans genotyped on 563,212 SNPs . We randomly sampled N individuals and selected top SNPs for height at p < 10−5 (red bar) and p < 10−4 (blue bar) to predict the phenotype in the same data. We also performed association analysis for real height phenotype in 10,000 individuals and selected top SNPs at p < 10−5 (green bar) and p < 10−4 (purple bar) to predict height phenotype in the same sample. The graph shows the mean prediction R2 over 100 simulation replicates. Error bar: standard error of the mean. The number on top of each column is the mean number of selected SNPs over 100 simulation replicates. b) Drosophila: An example, illustrating bias when selecting the top SNPs. We downloaded genotype data of the Drosophila Genetics Reference Panel and simulated phenotypes under the null hypothesis, i.e., random association between each of the > 1 million SNPs and phenotype. We repeated the GWAS analysis reported in, selecting the top 10 independently associated SNPs and predicted the phenotypes of the lines using these 10 SNPs. Since in the simulated data there are only random associations between SNP and phenotype any prediction power is false and result of over-fitting. By chance, the top SNPs (in terms of test statistic) explain 57% (R2=57%) of the phenotypic variance between the inbred lines, from a linear regression of phenotype on predictor. Both phenotype and predictor have been standardized to normal distribution z-scores (mean of zero and standard deviation of one). c) Dairy Cattle: The impact of leaving the validation cohort in the discovery set, either at both SNP selection (GWAS) and SNP effect estimation stages, or at the effect size estimation stage only. Data shown are from 2,732 bulls with ~500K SNPs phenotyped for average milk yield of their daughters’ milk production. The bulls were split into a discovery sample (bulls born during or before 2003), Nd = 2,458, and a validation sample (bulls born after 2003) of Nv= 274.
Figure 2
Figure 2. Examples of the overlap pitfall: non-independence of discovery and validation samples
a) Human: High R2 can be achieved by chance particularly when sample size is small. We simulated GWAS data based upon real human genotype data under the null hypothesis of no association. We used data of 11,586 unrelated European Americans genotyped on 563,212 SNPs . We randomly sampled N individuals and selected top SNPs for height at p < 10−5 (red bar) and p < 10−4 (blue bar) to predict the phenotype in the same data. We also performed association analysis for real height phenotype in 10,000 individuals and selected top SNPs at p < 10−5 (green bar) and p < 10−4 (purple bar) to predict height phenotype in the same sample. The graph shows the mean prediction R2 over 100 simulation replicates. Error bar: standard error of the mean. The number on top of each column is the mean number of selected SNPs over 100 simulation replicates. b) Drosophila: An example, illustrating bias when selecting the top SNPs. We downloaded genotype data of the Drosophila Genetics Reference Panel and simulated phenotypes under the null hypothesis, i.e., random association between each of the > 1 million SNPs and phenotype. We repeated the GWAS analysis reported in, selecting the top 10 independently associated SNPs and predicted the phenotypes of the lines using these 10 SNPs. Since in the simulated data there are only random associations between SNP and phenotype any prediction power is false and result of over-fitting. By chance, the top SNPs (in terms of test statistic) explain 57% (R2=57%) of the phenotypic variance between the inbred lines, from a linear regression of phenotype on predictor. Both phenotype and predictor have been standardized to normal distribution z-scores (mean of zero and standard deviation of one). c) Dairy Cattle: The impact of leaving the validation cohort in the discovery set, either at both SNP selection (GWAS) and SNP effect estimation stages, or at the effect size estimation stage only. Data shown are from 2,732 bulls with ~500K SNPs phenotyped for average milk yield of their daughters’ milk production. The bulls were split into a discovery sample (bulls born during or before 2003), Nd = 2,458, and a validation sample (bulls born after 2003) of Nv= 274.

Comment in

References

    1. de los Campos G, Gianola D, Allison DB. Predicting genetic predisposition in humans: the promise of whole-genome markers. Nat Rev Genet. 2010;11:880–886. - PubMed
    1. Gonzalez-Camacho JM, et al. Genome-enabled prediction of genetic values using radial basis function neural networks. Theoretical and Applied Genetics. 2012;125:759–771. - PMC - PubMed
    1. Crossa J, et al. Prediction of Genetic Values of Quantitative Traits in Plant Breeding Using Pedigree and Molecular Markers. Genetics. 2010;186:713-U406. - PMC - PubMed
    1. Wei Z, et al. From disease association to risk assessment: an optimistic view from genome-wide association studies on type 1 diabetes. PLoS Genet. 2009;5:e1000678. - PMC - PubMed
    1. de los Campos G, Hickey JM, Pong-Wong R, Daetwyler HD, Calus MPL. Whole genome regression and prediction methods applied to plant and animal breeding. Genetics. 2012 Published online June 28 2012. - PMC - PubMed

Substances

Grants and funding