HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes
- PMID: 37647640
- PMCID: PMC10493177
- DOI: 10.1093/bioinformatics/btad535
HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes
Abstract
Motivation: Existing methods for simulating synthetic genotype and phenotype datasets have limited scalability, constraining their usability for large-scale analyses. Moreover, a systematic approach for evaluating synthetic data quality and a benchmark synthetic dataset for developing and evaluating methods for polygenic risk scores are lacking.
Results: We present HAPNEST, a novel approach for efficiently generating diverse individual-level genotypic and phenotypic data. In comparison to alternative methods, HAPNEST shows faster computational speed and a lower degree of relatedness with reference panels, while generating datasets that preserve key statistical properties of real data. These desirable synthetic data properties enabled us to generate 6.8 million common variants and nine phenotypes with varying degrees of heritability and polygenicity across 1 million individuals. We demonstrate how HAPNEST can facilitate biobank-scale analyses through the comparison of seven methods to generate polygenic risk scoring across multiple ancestry groups and different genetic architectures.
Availability and implementation: A synthetic dataset of 1 008 000 individuals and nine traits for 6.8 million common variants is available at https://www.ebi.ac.uk/biostudies/studies/S-BSST936. The HAPNEST software for generating synthetic datasets is available as Docker/Singularity containers and open source Julia and C code at https://github.com/intervene-EU-H2020/synthetic_data.
© The Author(s) 2023. Published by Oxford University Press.
Conflict of interest statement
None declared.
Figures





Similar articles
-
The GenoPred pipeline: a comprehensive and scalable pipeline for polygenic scoring.Bioinformatics. 2024 Oct 1;40(10):btae551. doi: 10.1093/bioinformatics/btae551. Bioinformatics. 2024. PMID: 39292536 Free PMC article.
-
SummaryAUC: a tool for evaluating the performance of polygenic risk prediction models in validation datasets with only summary level statistics.Bioinformatics. 2019 Oct 15;35(20):4038-4044. doi: 10.1093/bioinformatics/btz176. Bioinformatics. 2019. PMID: 30911754 Free PMC article.
-
Seqminer2: an efficient tool to query and retrieve genotypes for statistical genetics analyses from biobank scale sequence dataset.Bioinformatics. 2020 Dec 8;36(19):4951-4954. doi: 10.1093/bioinformatics/btaa628. Bioinformatics. 2020. PMID: 32756942 Free PMC article.
-
Evolink: a phylogenetic approach for rapid identification of genotype-phenotype associations in large-scale microbial multispecies data.Bioinformatics. 2023 May 4;39(5):btad215. doi: 10.1093/bioinformatics/btad215. Bioinformatics. 2023. PMID: 37074922 Free PMC article.
-
Inferring the heritability of bacterial traits in the era of machine learning.Bioinform Adv. 2023 Mar 14;3(1):vbad027. doi: 10.1093/bioadv/vbad027. eCollection 2023. Bioinform Adv. 2023. PMID: 36974068 Free PMC article. Review.
Cited by
-
Challenges and applications in generative AI for clinical tabular data in physiology.Pflugers Arch. 2025 Apr;477(4):531-542. doi: 10.1007/s00424-024-03024-w. Epub 2024 Oct 17. Pflugers Arch. 2025. PMID: 39417878 Free PMC article. Review.
-
A resampling-based approach to share reference panels.Nat Comput Sci. 2024 May;4(5):360-366. doi: 10.1038/s43588-024-00630-7. Epub 2024 May 14. Nat Comput Sci. 2024. PMID: 38745108 Free PMC article.
-
Genetic fine-mapping from summary data using a nonlocal prior improves the detection of multiple causal variants.Bioinformatics. 2023 Jul 1;39(7):btad396. doi: 10.1093/bioinformatics/btad396. Bioinformatics. 2023. PMID: 37348543 Free PMC article.
-
HAP-SAMPLE2: data-based resampling for association studies with admixture.Bioinformatics. 2025 Jun 2;41(6):btaf333. doi: 10.1093/bioinformatics/btaf333. Bioinformatics. 2025. PMID: 40512000 Free PMC article.
-
A benchmark study on current GWAS models in admixed populations.Brief Bioinform. 2023 Nov 22;25(1):bbad437. doi: 10.1093/bib/bbad437. Brief Bioinform. 2023. PMID: 38037235 Free PMC article.
References
-
- Alaa AM, van Breugel B, Saveliev E. et al. How faithful is your synthetic data? Sample-level metrics for evaluating and auditing generative models, In: International Conference on Machine Learning, PMLR 2022.