Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2013 Jul 22:9:29.
doi: 10.1186/1746-4811-9-29. eCollection 2013.

The advantages and limitations of trait analysis with GWAS: a review

Affiliations
Review

The advantages and limitations of trait analysis with GWAS: a review

Arthur Korte et al. Plant Methods. .

Abstract

Over the last 10 years, high-density SNP arrays and DNA re-sequencing have illuminated the majority of the genotypic space for a number of organisms, including humans, maize, rice and Arabidopsis. For any researcher willing to define and score a phenotype across many individuals, Genome Wide Association Studies (GWAS) present a powerful tool to reconnect this trait back to its underlying genetics. In this review we discuss the biological and statistical considerations that underpin a successful analysis or otherwise. The relevance of biological factors including effect size, sample size, genetic heterogeneity, genomic confounding, linkage disequilibrium and spurious association, and statistical tools to account for these are presented. GWAS can offer a valuable first insight into trait architecture or candidate loci for subsequent validation.

Keywords: Arabidopsis; Effect size; GWAS; Genetic heterogeneity; Mixed model.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Sample size and effect size. a) Power and FDR for an idealized phenotype. Simulations in which a single random SNP explaining 5%, 10% or 20% of the phenotypic variance (with heritability ~0.75) were performed in either 200, 400 or 800 individuals [67]. Simulations are based on the available SNP data for Arabidopsis[20], with structure added by giving 10,000 random SNPs a tiny effects size. The star indicates power (the ability to find true positives) and FDR (false positives) at the 5% bonferroni-corrected threshold for 220,000 markers. b) An example of one particular simulation in which the causative SNP (red diamond) is not the most significant SNP in the local window. Remaining SNPs are colored according to their linkage to the causative SNP. Dashed line denotes the 5% bonferroni-corrected threshold for 220,000 markers.
Figure 2
Figure 2
Synthetic association due to genetic heterogeneity. a) A theoretical phylogenetic tree of three individuals upon which three mutations occur. The two most recent mutations (stars) cause a change in phenotype (red fruit). b) The older blue mutation has no affect on fruit colour, but is in perfect correlation with the trait. Neither causative mutation are very good predictors of the phenotype.
Figure 3
Figure 3
Taking genetic background into account improves the performance of GWAS. Manhattan plots for a simulated trait, in which each data point represents a genotyped SNP, ordered across the five chromosomes of Arabidopsis. Five SNPs (indicated by vertical dashed lines) were randomly chosen to be ‘causative’ and account for up to 10% of the phenotypic variance each. GWAS using a) a linear model, and b) a mixed model that accounts for population structure and other background genomic factors. The simple linear model leads to heavily inflated p-values and the five causative markers are not the strongest associations. The mixed model is superior, but still leads to one false negative and one false positive. A dashed horizontal line denotes the 5% Bonferroni threshold.
Figure 4
Figure 4
The mixed model dramatically reduces inflation of p-values. Quantile-Quantile plot showing strong p-values inflation for a marginal GWAS that does not consider population structure (red line). Accounting for population structure with the mixed model dramatically reduces inflation (blue line). The grey line indicates the expected p-value distribution under the null hypothesis of no causative markers in the data. Note, that after correction for population structure, only the most significant markers deviate from the null expectation.

References

    1. Alonso-Blanco C, El-Assal SE, Coupland G, Koornneef M. Analysis of natural allelic variation at flowering time loci in the Landsberg erecta and Cape Verde Islands ecotypes of Arabidopsis thaliana. Genetics. 1998;149(2):749–764. - PMC - PubMed
    1. Clarke JH, Mithen R, Brown JK, Dean C. QTL analysis of flowering time in Arabidopsis thaliana. Mol Gen Genet. 1995;248(3):278–286. doi: 10.1007/BF02191594. - DOI - PubMed
    1. Kowalski SP, Lan TH, Feldmann KA, Paterson AH. QTL mapping of naturally-occurring variation in flowering time of Arabidopsis thaliana. Mol Gen Genet. 1994;245(5):548–555. - PubMed
    1. Koornneef M, Alonso-Blanco C, Vreugdenhil D. Naturally occurring genetic variation in Arabidopsis thaliana. Annu Rev Plant Biol. 2004;55:141–172. doi: 10.1146/annurev.arplant.55.031903.141605. - DOI - PubMed
    1. Borevitz JO, Nordborg M. The impact of genomics on the study of natural variation in Arabidopsis. Plant Physiol. 2003;132(2):718–725. doi: 10.1104/pp.103.023549. - DOI - PMC - PubMed