Efficient storage and regression computation for population-scale genome sequencing studies

doi:10.1101/2024.04.11.589062

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2025 Jan 23:2024.04.11.589062.

doi: 10.1101/2024.04.11.589062.

Efficient storage and regression computation for population-scale genome sequencing studies

Manuel A Rivas¹, Christopher Chang²

Affiliations

PMID: 38659813
PMCID: PMC11042230
DOI: 10.1101/2024.04.11.589062

Efficient storage and regression computation for population-scale genome sequencing studies

Manuel A Rivas et al. bioRxiv. 2025.

[Preprint]. 2025 Jan 23:2024.04.11.589062.

doi: 10.1101/2024.04.11.589062.

Authors

Manuel A Rivas¹, Christopher Chang²

Affiliations

¹ Department of Biomedical Data Science, Stanford University, Stanford, CA, USA 94305.
² Grail, Inc, Menlo Park, CA.

PMID: 38659813
PMCID: PMC11042230
DOI: 10.1101/2024.04.11.589062

Update in

Efficient storage and regression computation for population-scale genome sequencing studies.
Rivas MA, Chang C. Rivas MA, et al. Bioinformatics. 2025 Mar 4;41(3):btaf067. doi: 10.1093/bioinformatics/btaf067. Bioinformatics. 2025. PMID: 39932865 Free PMC article.

Abstract

In the era of big data in human genetics, large-scale biobanks aggregating genetic data from diverse populations have emerged as important for advancing our understanding of human health and disease. However, the computational and storage demands of whole genome sequencing (WGS) studies pose significant challenges, especially for researchers from underfunded institutions or developing countries, creating a disparity in research capabilities. We introduce new approaches that significantly enhance computational efficiency and reduce data storage requirements for WGS studies. By developing algorithms for compressed storage of genetic data, focusing particularly on optimizing the representation of rare variants, and designing regression methods tailored for the scale and complexity of WGS data, we significantly lower computational and storage costs. We integrate our approach into PLINK 2.0. The implementation demonstrates considerable reductions in storage space and computational time without compromising analytical accuracy, as evidenced by the application to the AllofUs project data. We optimized the runtime of an exome-wide association analysis involving 19.4 million variants and the body mass index phenotype of 125,077 individuals, reducing it from 695.35 minutes (approximately 11.5 hours) on a single machine to just 1.57 minutes using 30 GB of memory and 50 threads (or 8.67 minutes with 4 threads). Additionally, we extended this approach to support multi-phenotype analyses. We anticipate that our approach will enable researchers across the globe to unlock the potential of population biobanks, accelerating the pace of discoveries that can improve our understanding of human health and disease.

PubMed Disclaimer

Figures

**FIG. 1:. Comparison of storage allocated to exome sequencing data in AllofUs project.**
Storage required to represent AllofUs exome sequencing genetic variant data for a range of file types. Each bar denotes the storage size, with values provided on top for clarity. This includes PLINK binary BED file, Hail splitMT and Hail multiMT files, VCF file, BGEN, and PLINK 2.0 PGEN file with sparse variant representation.

**FIG. 2:. Comparison of regression computational time.**
Computation time required to run regression on AllofUs body mass index (BMI) phenotype using 18 covariates including age, sex, and 16 principal components of the genetic data, across 19.4 million genetic variants in 125,077 individuals, for different computation scenarios. Each bar displays the time in minutes, with values provided on top for immediate reference. We highlight the significant efficiency gains achieved through optimizing computation methods and sparse genotype data representation for rare variants with PLINK 2.0.

**FIG. 3:**
P-value comparison for quantitative and binary data.

See this image and copyright information in PMC

References

1. Sinnott-Armstrong N. et al. Genetics of 35 blood and urine biomarkers in the UK Biobank. Nat. Genet. 53, 185–194 (2021). - PMC - PubMed
1. Tanigawa Y. et al. Rare protein-altering variants in ANGPTL7 lower intraocular pressure and protect against glaucoma. PLoS Genet. 16, e1008682 (2020). - PMC - PubMed
1. Akbari P. et al. Sequencing of 640,000 exomes identifies variants associated with protection from obesity. Science 373, (2021). - PMC - PubMed
1. All of Us Research Program Genomics Investigators. Genomic data in the All of Us Research Program. Nature 627, 340–346 (2024). - PMC - PubMed
1. Allen N. E. et al. Prospective study design and data analysis in UK Biobank. Sci. Transl. Med. 16, eadf4428 (2024). - PMC - PubMed

Publication types

Actions

Grants and funding

R01 HG010140/HG/NHGRI NIH HHS/United States

LinkOut - more resources

Full Text Sources

[1] Sinnott-Armstrong N. et al. Genetics of 35 blood and urine biomarkers in the UK Biobank. Nat. Genet. 53, 185–194 (2021). - PMC - PubMed

[2] Sinnott-Armstrong N. et al. Genetics of 35 blood and urine biomarkers in the UK Biobank. Nat. Genet. 53, 185–194 (2021). - PMC - PubMed

[3] Tanigawa Y. et al. Rare protein-altering variants in ANGPTL7 lower intraocular pressure and protect against glaucoma. PLoS Genet. 16, e1008682 (2020). - PMC - PubMed

[4] Tanigawa Y. et al. Rare protein-altering variants in ANGPTL7 lower intraocular pressure and protect against glaucoma. PLoS Genet. 16, e1008682 (2020). - PMC - PubMed

[5] Akbari P. et al. Sequencing of 640,000 exomes identifies variants associated with protection from obesity. Science 373, (2021). - PMC - PubMed

[6] Akbari P. et al. Sequencing of 640,000 exomes identifies variants associated with protection from obesity. Science 373, (2021). - PMC - PubMed

[7] All of Us Research Program Genomics Investigators. Genomic data in the All of Us Research Program. Nature 627, 340–346 (2024). - PMC - PubMed

[8] All of Us Research Program Genomics Investigators. Genomic data in the All of Us Research Program. Nature 627, 340–346 (2024). - PMC - PubMed

[9] Allen N. E. et al. Prospective study design and data analysis in UK Biobank. Sci. Transl. Med. 16, eadf4428 (2024). - PMC - PubMed

[10] Allen N. E. et al. Prospective study design and data analysis in UK Biobank. Sci. Transl. Med. 16, eadf4428 (2024). - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Efficient storage and regression computation for population-scale genome sequencing studies

Affiliations

Efficient storage and regression computation for population-scale genome sequencing studies

Authors

Affiliations

Update in

Abstract

Figures

Similar articles

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources

This is a preprint.

Update in

Abstract

Figures

Similar articles

References

Publication types

Related information

Grants and funding

LinkOut - more resources

Full Text Sources