This is a preprint.
Genotype Representation Graphs: Enabling Efficient Analysis of Biobank-Scale Data
- PMID: 38712040
- PMCID: PMC11071416
- DOI: 10.1101/2024.04.23.590800
Genotype Representation Graphs: Enabling Efficient Analysis of Biobank-Scale Data
Update in
-
Enabling efficient analysis of biobank-scale data with genotype representation graphs.Nat Comput Sci. 2025 Feb;5(2):112-124. doi: 10.1038/s43588-024-00739-9. Epub 2024 Dec 5. Nat Comput Sci. 2025. PMID: 39639156 Free PMC article.
Abstract
Computational analysis of a large number of genomes requires a data structure that can represent the dataset compactly while also enabling efficient operations on variants and samples. Current practice is to store large-scale genetic polymorphism data using tabular data structures and file formats, where rows and columns represent samples and genetic variants. However, encoding genetic data in such formats has become unsustainable. For example, the UK Biobank polymorphism data of 200,000 phased whole genomes has exceeded 350 terabytes (TB) in Variant Call Format (VCF), cumbersome and inefficient to work with. To mitigate the computational burden, we introduce the Genotype Representation Graph (GRG), an extremely compact data structure to losslessly present phased whole-genome polymorphisms. A GRG is a fully connected hierarchical graph that exploits variant-sharing across samples, leveraging ideas inspired by Ancestral Recombination Graphs. Capturing variant-sharing in a multitree structure compresses biobank-scale human data to the point where it can fit in a typical server's RAM (5-26 gigabytes (GB) per chromosome), and enables graph-traversal algorithms to trivially reuse computed values, both of which can significantly reduce computation time. We have developed a command-line tool and a library usable via both C++ and Python for constructing and processing GRG files which scales to a million whole genomes. It takes 160GB disk space to encode the information in 200,000 UK Biobank phased whole genomes as a GRG, more than 13 times smaller than the size of compressed VCF. We show that summaries of genetic variants such as allele frequency and association effect can be computed on GRG via graph traversal that runs significantly faster than all tested alternatives, including vcf.gz, PLINK BED, tree sequence, XSI, and Savvy. Furthermore, GRG is particularly suitable for doing repeated calculations and interactive data analysis. We anticipate that GRG-based algorithms will improve the scalability of various types of computation and generally lower the cost of analyzing large genomic datasets.
Figures






References
-
- Band Gavin, and Marchini Jonathan. 2018. “BGEN: A Binary File Format for Imputed Genotype and Haplotype Data.” Preprint. Bioinformatics. 10.1101/308296. - DOI
-
- Bloom Burton H. 1970. “Space/Time Trade-Offs in Hash Coding with Allowable Errors.” Communications of the ACM 13 (7): 422–26. 10.1145/362686.362692. - DOI
-
- Brandes Ulrik, Eiglsperger Markus, Herman Ivan, Himsolt Michael, and Marshall M. Scott. 2002. “GraphML Progress Report Structural Layer Proposal.” In Graph Drawing, edited by Mutzel Petra, Jünger Michael, and Leipert Sebastian, 2265:501–12. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer Berlin Heidelberg. 10.1007/3-540-45848-4_59. - DOI
Publication types
Grants and funding
LinkOut - more resources
Full Text Sources