HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes

Sophie Wharrie¹, Zhiyu Yang², Vishnu Raj¹, Remo Monti³, Rahul Gupta⁴, Ying Wang⁴, Alicia Martin⁴, Luke J O'Connor⁴, Samuel Kaski^{1

5}, Pekka Marttinen¹, Pier Francesco Palamara⁶, Christoph Lippert^{3

7}, Andrea Ganna^{2

4}

Affiliations

¹ Department of Computer Science, Aalto University, Espoo 02150, Finland.
² Institute for Molecular Medicine Finland, FIMM, HiLIFE, University of Helsinki, Helsinki 00014, Finland.
³ Hasso Plattner Institute, University of Potsdam, Digital Engineering Faculty, Potsdam 14469, Germany.
⁴ Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, United States.
⁵ Department of Computer Science, University of Manchester, Manchester M13 9PL, United Kingdom.
⁶ Department of Statistics, University of Oxford, Oxford OX1 2JD, United Kingdom.
⁷ Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, New York 10065, United States.

PMID: 37647640
PMCID: PMC10493177
DOI: 10.1093/bioinformatics/btad535

HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes

Sophie Wharrie et al. Bioinformatics. 2023.

. 2023 Sep 2;39(9):btad535.

doi: 10.1093/bioinformatics/btad535.

Authors

Affiliations

¹ Department of Computer Science, Aalto University, Espoo 02150, Finland.
² Institute for Molecular Medicine Finland, FIMM, HiLIFE, University of Helsinki, Helsinki 00014, Finland.
³ Hasso Plattner Institute, University of Potsdam, Digital Engineering Faculty, Potsdam 14469, Germany.
⁴ Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, United States.
⁵ Department of Computer Science, University of Manchester, Manchester M13 9PL, United Kingdom.
⁶ Department of Statistics, University of Oxford, Oxford OX1 2JD, United Kingdom.
⁷ Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, New York 10065, United States.

PMID: 37647640
PMCID: PMC10493177
DOI: 10.1093/bioinformatics/btad535

Abstract

Motivation: Existing methods for simulating synthetic genotype and phenotype datasets have limited scalability, constraining their usability for large-scale analyses. Moreover, a systematic approach for evaluating synthetic data quality and a benchmark synthetic dataset for developing and evaluating methods for polygenic risk scores are lacking.

Results: We present HAPNEST, a novel approach for efficiently generating diverse individual-level genotypic and phenotypic data. In comparison to alternative methods, HAPNEST shows faster computational speed and a lower degree of relatedness with reference panels, while generating datasets that preserve key statistical properties of real data. These desirable synthetic data properties enabled us to generate 6.8 million common variants and nine phenotypes with varying degrees of heritability and polygenicity across 1 million individuals. We demonstrate how HAPNEST can facilitate biobank-scale analyses through the comparison of seven methods to generate polygenic risk scoring across multiple ancestry groups and different genetic architectures.

Availability and implementation: A synthetic dataset of 1 008 000 individuals and nine traits for 6.8 million common variants is available at https://www.ebi.ac.uk/biostudies/studies/S-BSST936. The HAPNEST software for generating synthetic datasets is available as Docker/Singularity containers and open source Julia and C code at https://github.com/intervene-EU-H2020/synthetic_data.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

**Figure 1.**
(a) A reference set of real haplotypes, from which segments (coloured) are imperfectly copied to construct a synthetic haplotype. (b) Detailed view of an individual segment. The segment length, $ℓ$ , and coalescence time, T, are sampled from a stochastic model. The presence of a genetic variant at position i is only copied if $T \leq m_{i}$ , where $m_{i}$ is the variant’s age of mutation. Variants that are not copied are shown in red. (c) Synthetic genotypes, g, are constructed as pairs of synthetic haplotypes, $h_{j}$ , $j \in {1, 2}$ . (d) Once the genotype is generated, liability of phenotype will subsequently be assigned to each sample as a summation of genetic effect, covariate effect (if any) and environmental noise.

**Figure 2.**
(a) LD correlation for 500 contiguous SNPs selected at random from chromosome 21 HapMap3 variants, for the European-ancestry reference dataset ( $N_{ref} = 775$ ); (b) comparison of LD decay (Laido et al., 2014) for $N_{syn} = 1000$ European-ancestry synthetic samples; (c) comparison of LD correlation (for same 500 SNPs shown in reference panel) for $N_{syn} = 1000$ European-ancestry synthetic samples. We selected alleles with MAF $\geq 0.001$ and used plink with – r2 square flag to compute the LD correlation matrix.

**Figure 3.**
(a) PCA projection plot for N_syn = 10 002 synthetic samples generated by the HAPNEST method (multiobjective ABC), for chromosome 21 HapMap3 variants, $N_{ref} = 4062$ ; (b) comparison of PCA projection plots and bivariate densities for $N_{syn} = 1000$ European-ancestry synthetic samples ( $N_{ref} = 775$ ). The highest PC alignment score for preservation of population structure is 0.311 for HAPGEN2, 0.281 (HAPNEST LD objective), 0.222 (G2P), 0.182 (HAPNEST multiobjective), and 0.043 (Sim1000G).

**Figure 4.**
Simulation times for genotype datasets for HAPNEST and HAPGEN2 (other methods are excluded from this comparison due to scalability and compatibility issues), averaged for five trials with 95% confidence intervals plotted, for a varying number of synthetic samples, SNPs and computing threads. Missing results are due to an experiment being terminated for exceeding the memory limit.

**Figure 5.**
PRS results for two genetic architectures, averaged across three experiment trials with error bars showing the range of outcomes, for HapMap3 variants across 22 chromosomes. (a) Pearson correlation between predicted and observed values, for various PRS methods and a European-ancestry phenotype with heritability 0.1 and polygenicity 0.005. (b) Pearson correlation for various target ancestry groups for the best-performing PRS method (MegaPRS) for the heritability 0.1 and polygenicity 0.005 phenotype. (c) Pearson correlation between predicted and observed values, for various PRS methods and a European-ancestry phenotype with heritability 0.5 and polygenicity 0.0001. (d) Pearson correlation for various target ancestry groups for the best-performing PRS method (PRScs) for the heritability 0.5 and polygenicity 0.0001 phenotype.

See this image and copyright information in PMC

References

1. Alaa AM, van Breugel B, Saveliev E. et al. How faithful is your synthetic data? Sample-level metrics for evaluating and auditing generative models, In: International Conference on Machine Learning, PMLR 2022.
1. Albers PK, McVean G.. Dating genomic variants and shared ancestry in population-scale sequencing data. PLoS Biol 2020;18:e3000586. 10.1371/journal.pbio.3000586 - DOI - PMC - PubMed
1. Araújo DS, Wheeler HE.. Genetic and environmental variation impact transferability of polygenic risk scores. Cell Rep Med 2022;3:100687. - PMC - PubMed
1. Browning SR, Browning BL.. Probabilistic estimation of identity by descent segment endpoints and detection of recent selection. Am J Hum Genet 2020;107:895–910. - PMC - PubMed
1. Choi SW, Mak TS-H, O'Reilly PF.. Tutorial: a guide to performing polygenic risk score analyses. Nat Protoc 2020;15:2759–72. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes

Affiliations

HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources