Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Jan 23:2024.04.11.589062.
doi: 10.1101/2024.04.11.589062.

Efficient storage and regression computation for population-scale genome sequencing studies

Affiliations

Efficient storage and regression computation for population-scale genome sequencing studies

Manuel A Rivas et al. bioRxiv. .

Update in

Abstract

In the era of big data in human genetics, large-scale biobanks aggregating genetic data from diverse populations have emerged as important for advancing our understanding of human health and disease. However, the computational and storage demands of whole genome sequencing (WGS) studies pose significant challenges, especially for researchers from underfunded institutions or developing countries, creating a disparity in research capabilities. We introduce new approaches that significantly enhance computational efficiency and reduce data storage requirements for WGS studies. By developing algorithms for compressed storage of genetic data, focusing particularly on optimizing the representation of rare variants, and designing regression methods tailored for the scale and complexity of WGS data, we significantly lower computational and storage costs. We integrate our approach into PLINK 2.0. The implementation demonstrates considerable reductions in storage space and computational time without compromising analytical accuracy, as evidenced by the application to the AllofUs project data. We optimized the runtime of an exome-wide association analysis involving 19.4 million variants and the body mass index phenotype of 125,077 individuals, reducing it from 695.35 minutes (approximately 11.5 hours) on a single machine to just 1.57 minutes using 30 GB of memory and 50 threads (or 8.67 minutes with 4 threads). Additionally, we extended this approach to support multi-phenotype analyses. We anticipate that our approach will enable researchers across the globe to unlock the potential of population biobanks, accelerating the pace of discoveries that can improve our understanding of human health and disease.

PubMed Disclaimer

Figures

FIG. 1:
FIG. 1:. Comparison of storage allocated to exome sequencing data in AllofUs project.
Storage required to represent AllofUs exome sequencing genetic variant data for a range of file types. Each bar denotes the storage size, with values provided on top for clarity. This includes PLINK binary BED file, Hail splitMT and Hail multiMT files, VCF file, BGEN, and PLINK 2.0 PGEN file with sparse variant representation.
FIG. 2:
FIG. 2:. Comparison of regression computational time.
Computation time required to run regression on AllofUs body mass index (BMI) phenotype using 18 covariates including age, sex, and 16 principal components of the genetic data, across 19.4 million genetic variants in 125,077 individuals, for different computation scenarios. Each bar displays the time in minutes, with values provided on top for immediate reference. We highlight the significant efficiency gains achieved through optimizing computation methods and sparse genotype data representation for rare variants with PLINK 2.0.
FIG. 3:
FIG. 3:
P-value comparison for quantitative and binary data.

Similar articles

References

    1. Sinnott-Armstrong N. et al. Genetics of 35 blood and urine biomarkers in the UK Biobank. Nat. Genet. 53, 185–194 (2021). - PMC - PubMed
    1. Tanigawa Y. et al. Rare protein-altering variants in ANGPTL7 lower intraocular pressure and protect against glaucoma. PLoS Genet. 16, e1008682 (2020). - PMC - PubMed
    1. Akbari P. et al. Sequencing of 640,000 exomes identifies variants associated with protection from obesity. Science 373, (2021). - PMC - PubMed
    1. All of Us Research Program Genomics Investigators. Genomic data in the All of Us Research Program. Nature 627, 340–346 (2024). - PMC - PubMed
    1. Allen N. E. et al. Prospective study design and data analysis in UK Biobank. Sci. Transl. Med. 16, eadf4428 (2024). - PMC - PubMed

Publication types