Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2010 Feb 15;26(4):445-55.
doi: 10.1093/bioinformatics/btp713. Epub 2010 Jan 6.

Bioinformatics challenges for genome-wide association studies

Affiliations
Review

Bioinformatics challenges for genome-wide association studies

Jason H Moore et al. Bioinformatics. .

Abstract

The sequencing of the human genome has made it possible to identify an informative set of >1 million single nucleotide polymorphisms (SNPs) across the genome that can be used to carry out genome-wide association studies (GWASs). The availability of massive amounts of GWAS data has necessitated the development of new biostatistical methods for quality control, imputation and analysis issues including multiple testing. This work has been successful and has enabled the discovery of new associations that have been replicated in multiple studies. However, it is now recognized that most SNPs discovered via GWAS have small effects on disease susceptibility and thus may not be suitable for improving health care through genetic testing. One likely explanation for the mixed results of GWAS is that the current biostatistical analysis paradigm is by design agnostic or unbiased in that it ignores all prior knowledge about disease pathobiology. Further, the linear modeling framework that is employed in GWAS often considers only one SNP at a time thus ignoring their genomic and environmental context. There is now a shift away from the biostatistical approach toward a more holistic approach that recognizes the complexity of the genotype-phenotype relationship that is characterized by significant heterogeneity and gene-gene and gene-environment interaction. We argue here that bioinformatics has an important role to play in addressing the complexity of the underlying genetic basis of common human diseases. The goal of this review is to identify and discuss those GWAS challenges that will require computational methods.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Overview of the RF algorithm summarized in Section 2.3. Adapted from Reif et al. (2006).
Fig. 2.
Fig. 2.
Summary of the constructive induction process for MDR. The left bars within each cell represent the number of cases while the right bars represent the number of controls. Dark-shaded cells are high risk while the light-shaded cells are low risk. Prediction using any classifier can be carried out using the final constructed attribute.
Fig. 3.
Fig. 3.
Summary of how Relief, ReliefF and SURF select neighbors. Each panel in this figure shows the genotypes at two markers for a dataset of cases and controls. For the purpose of this example only these two markers will be considered and both are continuous. When analyzing real data, the process of selecting neighbors is the same, however, but there will be thousands of discrete valued markers (SNPs) each of which would be represented by one of thousands of dimensions. The individual for whom neighbors are being found is shown by the filled red circle. The neighbors that each approach uses for weighting are highlighted in blue. (AC) Represent how Relief, ReliefF and SURF would select neighbors to be used in weighting. Relief selects the nearest individual of the same dichotomous class (blue circle) and the nearest individual of the other class (blue cross). ReliefF selects some user specified number of individuals (two in this example) to be used for weighting. SURF, instead of using a fixed number of neighbors, uses all individuals within a distance threshold. The dotted line shows a hypothetical distance threshold.
Fig. 4.
Fig. 4.
Flowchart for a simple GP. The goal is to randomly generate an initial population of computer programs or solutions (e.g. genetic models), determine their fitness, select the best models, introduce variability and then iterate until the termination criteria are satisfied. This executes a parallel stochastic search using the principles of evolution by natural selection.
Fig. 5.
Fig. 5.
Flowchart for bioinformatics analyses of GWAS data. The use of filter and wrapper algorithms along with computational modeling approaches is recommended in addition to parametric statistical methods. Biological knowledge in public databases has a very important role to play at all levels of the analysis and interpretation.

Similar articles

Cited by

References

    1. Ahmed S, et al. Newly discovered breast cancer susceptibility loci on 3p24 and 17q23.2. Nat. Genet. 2009;41:585–590. - PMC - PubMed
    1. Amundadottir L, et al. Genome-wide association study identifies variants in the ABO locus associated with susceptibility to pancreatic cancer. Nat. Genet. 2009;41:986–990. - PMC - PubMed
    1. Amos CI. Successful design and conduct of genome-wide association studies. Hum. Mol. Genet. 2007;16:R220–R225. - PMC - PubMed
    1. Andrew AS, et al. Concordance of multiple analytical approaches demonstrates a complex relationship between DNA repair gene SNPs, smoking and bladder cancer susceptibility. Carcinogenesis. 2006;27:1030–1037. - PubMed
    1. Askland K, et al. Pathways-based analyses of whole-genome association study data in bipolar disorder reveal genes mediating ion channel activity and synaptic neurotransmission. Hum. Genet. 2009;125:63–79. - PubMed

Publication types