Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Sep 1:11:442.
doi: 10.1186/1471-2105-11-442.

Forward-time simulation of realistic samples for genome-wide association studies

Affiliations

Forward-time simulation of realistic samples for genome-wide association studies

Bo Peng et al. BMC Bioinformatics. .

Abstract

Background: Forward-time simulations have unique advantages in power and flexibility for the simulation of genetic samples of complex human diseases because they can closely mimic the evolution of human populations carrying these diseases. However, a number of methodological and computational constraints have prevented the power of this simulation method from being fully explored in existing forward-time simulation methods.

Results: Using a general-purpose forward-time population genetics simulation environment, we developed a forward-time simulation method that can be used to simulate realistic samples for genome-wide association studies. We examined the properties of this simulation method by comparing simulated samples with real data and demonstrated its wide applicability using four examples, including a simulation of case-control samples with a disease caused by multiple interacting genetic and environmental factors, a simulation of trio families affected by a disease-predisposing allele that had been subjected to either slow or rapid selective sweep, and a simulation of a structured population resulting from recent population admixture.

Conclusions: Our algorithm simulates populations that closely resemble the complex structure of the human genome, while allows the introduction of signals of natural selection. Because of its flexibility to generate different types of samples with arbitrary disease or quantitative trait models, this simulation method can simulate realistic samples to evaluate the performance of a wide variety of statistical gene mapping methods for genome-wide association studies.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Allele frequencies of the initial (x-axis) and expanded (y-axis) populations of four simulations with populations sizes 50000, 25000, 10000 and 50000, and scaling factors λ = 1 (unscaled), 2, 5 and 5 respectively.
Figure 2
Figure 2
Average LD values as a function of marker distance for the initial population and four expanded populations of sizes 50000, 25000, 10000 and 50000, using scaling factors λ = 1 (unscaled), 2, 5, and 5 respectively. The y-axis is plotted in log scale to distinguish LD curves in low LD regions. Marker distances were cut into bins of 10 kbp. For example, the average LD at point 200 kbp represents the mean pairwise LD values of all pairs of markers that were 200 kbp to 210 kbp apart.
Figure 3
Figure 3
Negative of the base 10 logarithm of p-values of allele-based χ2 tests between 1000 cases and 1000 controls at 6000 markers (2000 each) on chromosomes 2, 5, and 10. Markers rs4491689 and rs6869003 are causal. Marker rs7720081 has low p-value because it is closely linked to marker rs6869003.
Figure 4
Figure 4
An initial population of 170 independent individuals of the JPT+CHB population of Phase 3 of the HapMap data set was expanded to large populations and subjected to slow (B, D, F) and rapid (C, E, G) selective sweeps at locus rs2173746. The trajectories of the frequencies of allele T at this marker in simulations after slow (B) and rapid (C) sweeps are plotted. The LD maps of 500 markers on chromosome 2 of the initial population (A), and 100 markers around locus rs2173746 of expanded populations after the slow (D) and rapid (E) sweeps are plotted. 1000 cases and 1000 controls were drawn from these expanded populations. The negative of the base 10 logarithm of p-values at 500 markers are plotted for slow (F) and rapid (G) sweeps.
Figure 5
Figure 5
Ancestry values and p-values of association tests. The top figures plot recorded and estimated MKK ancestry values of 500 cases (a) and 500 controls (b). Individuals are sorted by their true MKK ancestry values. The bottom figures plot the negative of the base 10 logarithm of p-values of allele-based χ2 tests (c) and structured association tests (d) between 500 cases and 500 controls at 2000 markers.

Similar articles

Cited by

References

    1. Sham PC, Purcell S, Cherny SS, Abecasis GR. Powerful regression-based quantitative-trait linkage analysis of general pedigrees. Am J Hum Genet. 2002;71(2):238–253. doi: 10.1086/341560. - DOI - PMC - PubMed
    1. Amos CI, Krushkal J, Thiel TJ, Young A, Zhu DK, de Andrade EBM. Comparison of model-free linkage mapping strategies for the study of a complex trait. Genet Epidemiol. 1996;14:743–748. doi: 10.1002/(SICI)1098-2272(1997)14:6<743::AID-GEPI30>3.0.CO;2-O. - DOI - PubMed
    1. Reich D, Patterson N. Will admixture mapping work to find disease genes? Phil Trans R Soc B. 2005;360:1605–1607. doi: 10.1098/rstb.2005.1691. - DOI - PMC - PubMed
    1. Mehta T, Tanik M, Allison DB. Towards sound epistemological foundations of statistical methods for high-dimensional biology. Nat Genet. 2004;36(9):943–947. doi: 10.1038/ng1422. - DOI - PubMed
    1. Amos CI, Wu X, Broderick P, Gorlov IP, Gu J, Eisen T, Dong Q, Zhang Q, Gu X, Vijayakrishnan J. et al.Genome-wide association scan of tag SNPs identifies a susceptibility locus for lung cancer at 15q25.1. Nat Genet. 2008;40(5):616–622. doi: 10.1038/ng.109. - DOI - PMC - PubMed

Publication types

Substances

LinkOut - more resources