Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Oct 29:2024.10.27.620212.
doi: 10.1101/2024.10.27.620212.

Integer programming framework for pangenome-based genome inference

Affiliations

Integer programming framework for pangenome-based genome inference

Ghanshyam Chandra et al. bioRxiv. .

Update in

Abstract

Affordable genotyping methods are essential in genomics. Commonly used genotyping methods primarily support single nucleotide variants and short indels but neglect structural variants. Additionally, accuracy of read alignments to a reference genome is unreliable in highly polymorphic and repetitive regions, further impacting genotyping performance. Recent works highlight the advantage of haplotype-resolved pangenome graphs in addressing these challenges. Building on these developments, we propose a rigorous alignment-free genotyping framework. Our formulation seeks a path through the pangenome graph that maximizes the matches between the path and substrings of sequencing reads (e.g., k-mers) while minimizing recombination events (haplotype switches) along the path. We prove that this problem is NP-Hard and develop efficient integer-programming solutions. We benchmarked the algorithm using downsampled short-read datasets from homozygous human cell lines with coverage ranging from 0.1× to 10×. Our algorithm accurately estimates complete major histocompatibility complex (MHC) haplotype sequences with small edit distances from the ground-truth sequences, providing a significant advantage over existing methods on low-coverage inputs. Although our algorithm is designed for haploid samples, we discuss future extensions to diploid samples.

PubMed Disclaimer

Figures

Fig. 1:
Fig. 1:
A small example of our reduction from Hamiltonian Path Problem to Problem 1 (Theorem 1). (Top) The starting instance of G of Hamiltonian Path Problem. (Bottom) The vertex labeled graph G constructed from G. Here, n=4 and we assume c=2, making b=log2(n+2(c(n+1)+1))+1=6. Each edge is supported by a unique haplotype (not shown). The string set is 𝓢={0000010000001,0000100000001,,0110100000001}.
Fig. 2:
Fig. 2:
Evaluation of the performance of the IQP method with and without relaxation of the binary edge variables xuv. We compared runtime using various short-read datasets.
Fig. 3:
Fig. 3:
Evaluation of the performance of the ILP method with and without relaxation of the binary edge variables xuv. We compared runtime using various short-read datasets.
Fig. 1:
Fig. 1:
A simple illustration of an haplotype-resolved pangenome graph with two haplotype paths highlighted in pink and blue colors. An inferred path with a single recombination is shown as a dashed line.
Fig. 2:
Fig. 2:
(A) A pangenome graph with four haplotype paths h1,h2,h3 and h4. Set of haplotype paths passing through a vertex is listed below each vertex. (B) The corresponding expanded graph which includes four disjoint paths, one for each haplotype path. The recombination edges are shown in purple, these edges have a weight of c. We consider only the useful recombinations (Lemma 3). The edges which are not recombination edges in the expanded graph have a weight of 0. (C) The corresponding optimized expanded graph.
Fig. 3:
Fig. 3:
Accuracy of haplotype sequences estimated by PHI, VG and PanGenie using short reads from MHC sequences of five haplotypes (APD, DBB, MANN, QBL, SSTO). The x-axes indicate the coverage of short-read data. The y-axes indicate the edit distance between the estimate haplotype sequence and the ground-truth sequence on a logarithmic scale.
Fig. 4:
Fig. 4:
Performance comparison between the ILP and IQP solutions implemented in PHI. We compared their runtime and memory-usage using short-read sequencing datasets sampled from five haplotypes.
Fig. 5:
Fig. 5:
Assessement of PHI’s performance with the increasing number of genomes in pangenome graph. The left figure shows the accuracy in terms of edit distance between the output sequences and ground-truth sequences. The middle and right figure show the runtime and memory-usage respectively.

References

    1. Baaijens J.A., Bonizzoni P., Boucher C., Della Vedova G., Pirola Y., Rizzi R., Sirén J.: Computational graph pangenomics: a tutorial on data structures and their applications. Natural Computing pp. 1–28 (2022) - PMC - PubMed
    1. Bradbury P.J., Casstevens T., Jensen S.E., Johnson L., Miller Z., Monier B., Romay M., Song B., Buckler E.S.: The practical haplotype graph, a platform for storing and using pangenomes for imputation. Bioinformatics 38(15), 3698–3702 (2022) - PMC - PubMed
    1. Chandra G., Gibney D., Jain C.: Haplotype-aware sequence alignment to pangenome graphs. Genome Research 34(9), 1265–1275 (2024) - PMC - PubMed
    1. Computational Pan-Genomics Consortium: Computational pan-genomics: status, promises and challenges. Briefings in bioinformatics 19(1), 118–135 (2018) - PMC - PubMed
    1. Davies R.W., Kucka M., Su D., et al. : Rapid genotype imputation from sequence with reference panels. Nature Genetics 53(7), 1104–1111 (Jun 2021). 10.1038/s41588-021-00877-0 - DOI - PMC - PubMed

Publication types