Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Sep 2;39(9):btad535.
doi: 10.1093/bioinformatics/btad535.

HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes

Affiliations

HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes

Sophie Wharrie et al. Bioinformatics. .

Abstract

Motivation: Existing methods for simulating synthetic genotype and phenotype datasets have limited scalability, constraining their usability for large-scale analyses. Moreover, a systematic approach for evaluating synthetic data quality and a benchmark synthetic dataset for developing and evaluating methods for polygenic risk scores are lacking.

Results: We present HAPNEST, a novel approach for efficiently generating diverse individual-level genotypic and phenotypic data. In comparison to alternative methods, HAPNEST shows faster computational speed and a lower degree of relatedness with reference panels, while generating datasets that preserve key statistical properties of real data. These desirable synthetic data properties enabled us to generate 6.8 million common variants and nine phenotypes with varying degrees of heritability and polygenicity across 1 million individuals. We demonstrate how HAPNEST can facilitate biobank-scale analyses through the comparison of seven methods to generate polygenic risk scoring across multiple ancestry groups and different genetic architectures.

Availability and implementation: A synthetic dataset of 1 008 000 individuals and nine traits for 6.8 million common variants is available at https://www.ebi.ac.uk/biostudies/studies/S-BSST936. The HAPNEST software for generating synthetic datasets is available as Docker/Singularity containers and open source Julia and C code at https://github.com/intervene-EU-H2020/synthetic_data.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
(a) A reference set of real haplotypes, from which segments (coloured) are imperfectly copied to construct a synthetic haplotype. (b) Detailed view of an individual segment. The segment length, , and coalescence time, T, are sampled from a stochastic model. The presence of a genetic variant at position i is only copied if Tmi, where mi is the variant’s age of mutation. Variants that are not copied are shown in red. (c) Synthetic genotypes, g, are constructed as pairs of synthetic haplotypes, hj, j{1,2}. (d) Once the genotype is generated, liability of phenotype will subsequently be assigned to each sample as a summation of genetic effect, covariate effect (if any) and environmental noise.
Figure 2.
Figure 2.
(a) LD correlation for 500 contiguous SNPs selected at random from chromosome 21 HapMap3 variants, for the European-ancestry reference dataset (Nref=775); (b) comparison of LD decay (Laido et al., 2014) for Nsyn=1000 European-ancestry synthetic samples; (c) comparison of LD correlation (for same 500 SNPs shown in reference panel) for Nsyn=1000 European-ancestry synthetic samples. We selected alleles with MAF 0.001 and used plink with – r2 square flag to compute the LD correlation matrix.
Figure 3.
Figure 3.
(a) PCA projection plot for Nsyn = 10 002 synthetic samples generated by the HAPNEST method (multiobjective ABC), for chromosome 21 HapMap3 variants, Nref=4062; (b) comparison of PCA projection plots and bivariate densities for Nsyn=1000 European-ancestry synthetic samples (Nref=775). The highest PC alignment score for preservation of population structure is 0.311 for HAPGEN2, 0.281 (HAPNEST LD objective), 0.222 (G2P), 0.182 (HAPNEST multiobjective), and 0.043 (Sim1000G).
Figure 4.
Figure 4.
Simulation times for genotype datasets for HAPNEST and HAPGEN2 (other methods are excluded from this comparison due to scalability and compatibility issues), averaged for five trials with 95% confidence intervals plotted, for a varying number of synthetic samples, SNPs and computing threads. Missing results are due to an experiment being terminated for exceeding the memory limit.
Figure 5.
Figure 5.
PRS results for two genetic architectures, averaged across three experiment trials with error bars showing the range of outcomes, for HapMap3 variants across 22 chromosomes. (a) Pearson correlation between predicted and observed values, for various PRS methods and a European-ancestry phenotype with heritability 0.1 and polygenicity 0.005. (b) Pearson correlation for various target ancestry groups for the best-performing PRS method (MegaPRS) for the heritability 0.1 and polygenicity 0.005 phenotype. (c) Pearson correlation between predicted and observed values, for various PRS methods and a European-ancestry phenotype with heritability 0.5 and polygenicity 0.0001. (d) Pearson correlation for various target ancestry groups for the best-performing PRS method (PRScs) for the heritability 0.5 and polygenicity 0.0001 phenotype.

Similar articles

Cited by

References

    1. Alaa AM, van Breugel B, Saveliev E. et al. How faithful is your synthetic data? Sample-level metrics for evaluating and auditing generative models, In: International Conference on Machine Learning, PMLR 2022.
    1. Albers PK, McVean G.. Dating genomic variants and shared ancestry in population-scale sequencing data. PLoS Biol 2020;18:e3000586. 10.1371/journal.pbio.3000586 - DOI - PMC - PubMed
    1. Araújo DS, Wheeler HE.. Genetic and environmental variation impact transferability of polygenic risk scores. Cell Rep Med 2022;3:100687. - PMC - PubMed
    1. Browning SR, Browning BL.. Probabilistic estimation of identity by descent segment endpoints and detection of recent selection. Am J Hum Genet 2020;107:895–910. - PMC - PubMed
    1. Choi SW, Mak TS-H, O'Reilly PF.. Tutorial: a guide to performing polygenic risk score analyses. Nat Protoc 2020;15:2759–72. - PMC - PubMed

Publication types