Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Aug 21:2024.04.23.590800.
doi: 10.1101/2024.04.23.590800.

Genotype Representation Graphs: Enabling Efficient Analysis of Biobank-Scale Data

Affiliations

Genotype Representation Graphs: Enabling Efficient Analysis of Biobank-Scale Data

Drew DeHaas et al. bioRxiv. .

Update in

Abstract

Computational analysis of a large number of genomes requires a data structure that can represent the dataset compactly while also enabling efficient operations on variants and samples. Current practice is to store large-scale genetic polymorphism data using tabular data structures and file formats, where rows and columns represent samples and genetic variants. However, encoding genetic data in such formats has become unsustainable. For example, the UK Biobank polymorphism data of 200,000 phased whole genomes has exceeded 350 terabytes (TB) in Variant Call Format (VCF), cumbersome and inefficient to work with. To mitigate the computational burden, we introduce the Genotype Representation Graph (GRG), an extremely compact data structure to losslessly present phased whole-genome polymorphisms. A GRG is a fully connected hierarchical graph that exploits variant-sharing across samples, leveraging ideas inspired by Ancestral Recombination Graphs. Capturing variant-sharing in a multitree structure compresses biobank-scale human data to the point where it can fit in a typical server's RAM (5-26 gigabytes (GB) per chromosome), and enables graph-traversal algorithms to trivially reuse computed values, both of which can significantly reduce computation time. We have developed a command-line tool and a library usable via both C++ and Python for constructing and processing GRG files which scales to a million whole genomes. It takes 160GB disk space to encode the information in 200,000 UK Biobank phased whole genomes as a GRG, more than 13 times smaller than the size of compressed VCF. We show that summaries of genetic variants such as allele frequency and association effect can be computed on GRG via graph traversal that runs significantly faster than all tested alternatives, including vcf.gz, PLINK BED, tree sequence, XSI, and Savvy. Furthermore, GRG is particularly suitable for doing repeated calculations and interactive data analysis. We anticipate that GRG-based algorithms will improve the scalability of various types of computation and generally lower the cost of analyzing large genomic datasets.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:. Data encodings for phased polymorphisms.
Phased diploid genotypes in three different formats. a: A phased VCF file. Rows are genetic polymorphisms, columns are individual genotypes. The reference allele is labeled as 0 and alternative alleles are non-zero. A vertical bar separates an individual’s genotype into two haplotypes. b: Encoding the data as a matrix. This is essentially a rotation of the VCF data, with columns representing polymorphisms and rows representing haplotypes. c: A GRG representation of the same data. Each node represents a unique subset of samples, and each mutation is mapped to one node. The node IDs are labeled inside nodes, and the mutation IDs are marked next to the mutation nodes. Samples are reachable from a mutation node iff the mutation is present in the sample haplotype. This GRG can be described as the tables in the lower panel c.
Figure 2:
Figure 2:. Overview and performance of the GRG construction algorithm.
a: GRG construction includes four high-level steps: (1) segmenting the genome, (2) building an initial tGRG for each segment, (3) mapping mutations from each segment to the corresponding tGRG, and (4) merging all tGRGs into a single GRG. Dashed lines represent edges/nodes newly created in the current step. b, c, and d summarize the performance of GRG construction against various sample sizes using simulated data of a 100 mega base pair (Mbp) region. b: The total construction time (in green) and the time breaking down into BuildShape (in yellow) and MapMutations (in blue) against the number of samples. c: Memory usage in terms of average RAM (in blue) and maximum RAM (in green) against the number of samples. d: Graph size of the resulting GRGs breaking down into the number of edges (in grey) and number of nodes (in teal) against the number of samples.
Figure 3:
Figure 3:. Cost, time, and file size for GRG construction.
a: Comparisons of file sizes across PLINK BED, vcf.gz, GRG, GRG(zstd), Savvy, and XSI against different numbers of samples, with a sequence length of 108 base pairs for all data points. b: Comparisons of GRG against vcf.gz, Savvy and XSI for storing the whole-genome polymorphisms in the UK Biobank. Each chromosome is represented by one data point. The performance of Savvy and XSI is tested with two chromosomes, 13 and 22. c-d: The construction time (elapsed time, not CPU time) (c) and computational cost (d) of GRG, XSI and Savvy with UK Biobank whole-genome sequencing data. The most economical cloud node was used for each method. A cloud node with up to 72 CPU threads is used for the construction of GRGs, while a cheaper node with up to 2 threads is used for constructing XSI and Savvy.
Figure 4:
Figure 4:. GRG accelerates genome-wide summary statistic computation.
a: Benchmarking GRG-based computation against other methods for genome-wide allele frequencies in simulated datasets. Simulated datasets with a sequence of 108 base pairs and variable sample sizes are used in this experiment. Standard tabular methods (vcf.gz and BED files) are analyzed by PLINK. XSI, Savvy and tree sequence files (tskit) are analyzed with functions from their respective libraries (see Supplementary). b-c: Comparison of computation time (b) and cloud compute cost (c) for allele frequencies with different data structures in the UK Biobank dataset. The most economical cloud node was used for each method. The left point of each line shows the result of chromosome 22 and the right point shows that of chromosome 13. d: Comparison of computation time for doing association analyses of a simulated 100 Mbp region or the dot product computation between genotypes and a phenotype vector in the simulated dataset, using matrix operations (PLINK and XSI) or graph traversals (GRG). e: Comparison of runtimes of the association effect computations between genotypes and a phenotype vector in the UK Biobank dataset, using XSI and GRG. Chromosomes 13 and 22 are shown. f: Runtimes for dot-product using GRG with varying numbers of phenotypes. Solid lines present the runtimes for loading GRGs into RAM once and running dot-product calculations with either one phenotype or ten phenotypes. The dash line is the expected time if the dot-product is calculated independently ten times.
Figure 5:
Figure 5:. The biological interpretation of the GRG data structure.
a: The tree above a GRG node can originate from the pedigree inheritance. The left panel shows the full pedigree of a focal haplotype, including the ancestral inheritance and recombination events. The right panel shows the specific haplotypes in the pedigree that are passed down to the focal haplotype, and the paths to the focal haplotype form a tree shape. Here, recombination events cause the merging of lineages. b: GRG as a reduction of ARG. The reference and sample haplotypes are listed. Here, we highlight the genealogies in ARG that remain in the corresponding GRG. When present in an ARG, the following redundant structures are removed to become a GRG: 1) edges and nodes that represent the same topological structure with different coalescence time (e.g., one of the two light gray edges in the ARG is removed); 2) edges without mutations (e.g., the upper green edge in the ARG is removed); 3) mutations that are inherited by the same set of haplotypes through different paths are merged (e.g., m2 and m3).
Figure 6:
Figure 6:. Schematic illustrations of graph-based algorithms for computing allele frequencies and association effects.
a: Computation of allele frequencies using standard tabular methods. Genotypes of each variant are read row by row from disk and allele frequencies are computed independently for each variant in the tabular structure. b: Computation of allele frequencies using GRG. In a GRG, the sample frequencies of nodes are firstly computed and then mapped to the mutations based on the node-mutation mapping. The frequency of leaf nodes is set as 1/n (e.g., node0–3), where n is the number of haplotypes in the population. For other nodes (e.g., internal nodes and root nodes), frequencies are computed sequentially based on the node IDs (e.g., from node4 to node7) as the sum of frequencies of each node’s children nodes (e.g., node4 = node0 + node1). c: Computing graph traversal. The sample count is the count of haplotypes beneath a node. The homozygote count can be propagated using the association effect (β) with genotype and phenotype vectors. d: A schematic illustration of computing association effect through GRG.numIndividualCoals via graph traversal. The computation of XTY is similar to the sample count, except that the initialized values for leaf nodes (labeled inside the leaf nodes) are the phenotypic values of the sample nodes at the individual level. The term XTX is computed as samplecount+2×homozygotecount.

References

    1. Abdellaoui Abdel, Yengo Loic, Verweij Karin J.H., and Visscher Peter M.. 2023. “15 Years of GWAS Discovery: Realizing the Promise.” The American Journal of Human Genetics 110 (2): 179–94. 10.1016/j.ajhg.2022.12.011. - DOI - PMC - PubMed
    1. Band Gavin, and Marchini Jonathan. 2018. “BGEN: A Binary File Format for Imputed Genotype and Haplotype Data.” Preprint. Bioinformatics. 10.1101/308296. - DOI
    1. Baumdicker Franz, Bisschop Gertjan, Goldstein Daniel, Gower Graham, Ragsdale Aaron P, Tsambos Georgia, Zhu Sha, et al. 2022. “Efficient Ancestry and Mutation Simulation with Msprime 1.0.” Edited by Browning S. Genetics 220 (3): iyab229. 10.1093/genetics/iyab229. - DOI - PMC - PubMed
    1. Bloom Burton H. 1970. “Space/Time Trade-Offs in Hash Coding with Allowable Errors.” Communications of the ACM 13 (7): 422–26. 10.1145/362686.362692. - DOI
    1. Brandes Ulrik, Eiglsperger Markus, Herman Ivan, Himsolt Michael, and Marshall M. Scott. 2002. “GraphML Progress Report Structural Layer Proposal.” In Graph Drawing, edited by Mutzel Petra, Jünger Michael, and Leipert Sebastian, 2265:501–12. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer Berlin Heidelberg. 10.1007/3-540-45848-4_59. - DOI

Publication types

LinkOut - more resources