This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Oct 29:2024.10.27.620212.

doi: 10.1101/2024.10.27.620212.

Integer programming framework for pangenome-based genome inference

Ghanshyam Chandra¹, Md Helal Hossen², Stephan Scholz^{3

4}, Alexander T Dilthey^{3

4}, Daniel Gibney², Chirag Jain¹

Affiliations

¹ Department of Computational and Data Sciences, Indian Institute of Science, Bangalore KA 560012, India.
² Department of Computer Science, The University of Texas at Dallas, TX 75080, USA.
³ Institute of Medical Microbiology and Hospital Hygiene, Heinrich Heine University Düsseldorf, Düsseldorf, Germany.
⁴ Center for Digital Medicine, Heinrich Heine University Düsseldorf, Düsseldorf, Germany.

PMID: 39554168
PMCID: PMC11565907
DOI: 10.1101/2024.10.27.620212

Integer programming framework for pangenome-based genome inference

Ghanshyam Chandra et al. bioRxiv. 2024.

[Preprint]. 2024 Oct 29:2024.10.27.620212.

doi: 10.1101/2024.10.27.620212.

Authors

Ghanshyam Chandra¹, Md Helal Hossen², Stephan Scholz^{3

4}, Alexander T Dilthey^{3

4}, Daniel Gibney², Chirag Jain¹

Affiliations

¹ Department of Computational and Data Sciences, Indian Institute of Science, Bangalore KA 560012, India.
² Department of Computer Science, The University of Texas at Dallas, TX 75080, USA.
³ Institute of Medical Microbiology and Hospital Hygiene, Heinrich Heine University Düsseldorf, Düsseldorf, Germany.
⁴ Center for Digital Medicine, Heinrich Heine University Düsseldorf, Düsseldorf, Germany.

PMID: 39554168
PMCID: PMC11565907
DOI: 10.1101/2024.10.27.620212

Update in

Pangenome-based genome inference using integer programming.
Chandra G, Hossen MH, Scholz S, Dilthey AT, Gibney D, Jain C. Chandra G, et al. Genome Res. 2025 Dec 3;35(12):2661-2670. doi: 10.1101/gr.280567.125. Genome Res. 2025. PMID: 40841174

Abstract

Affordable genotyping methods are essential in genomics. Commonly used genotyping methods primarily support single nucleotide variants and short indels but neglect structural variants. Additionally, accuracy of read alignments to a reference genome is unreliable in highly polymorphic and repetitive regions, further impacting genotyping performance. Recent works highlight the advantage of haplotype-resolved pangenome graphs in addressing these challenges. Building on these developments, we propose a rigorous alignment-free genotyping framework. Our formulation seeks a path through the pangenome graph that maximizes the matches between the path and substrings of sequencing reads (e.g., k-mers) while minimizing recombination events (haplotype switches) along the path. We prove that this problem is NP-Hard and develop efficient integer-programming solutions. We benchmarked the algorithm using downsampled short-read datasets from homozygous human cell lines with coverage ranging from 0.1× to 10×. Our algorithm accurately estimates complete major histocompatibility complex (MHC) haplotype sequences with small edit distances from the ground-truth sequences, providing a significant advantage over existing methods on low-coverage inputs. Although our algorithm is designed for haploid samples, we discuss future extensions to diploid samples.

PubMed Disclaimer

Figures

**Fig. 1:**
A small example of our reduction from Hamiltonian Path Problem to Problem 1 (Theorem 1). (Top) The starting instance of $G$ of Hamiltonian Path Problem. (Bottom) The vertex labeled graph $G^{'}$ constructed from $G$ . Here, $n = 4$ and we assume $c = 2$ , making $b = 〈 \log_{2} (n + 2 (c (n + 1) + 1)) 〉 + 1 = 6$ . Each edge is supported by a unique haplotype (not shown). The string set is $𝓢 = {0000010000001, 0000100000001, \dots, 0110100000001}$ .

**Fig. 2:**
Evaluation of the performance of the IQP method with and without relaxation of the binary edge variables $x_{u v}$ . We compared runtime using various short-read datasets.

**Fig. 3:**
Evaluation of the performance of the ILP method with and without relaxation of the binary edge variables $x_{u v}$ . We compared runtime using various short-read datasets.

**Fig. 4:**
Performance comparison between the ILP and IQP solutions implemented in PHI. We compared their runtime and memory-usage using short-read sequencing datasets sampled from five haplotypes.

**Fig. 5:**
Assessement of PHI’s performance with the increasing number of genomes in pangenome graph. The left figure shows the accuracy in terms of edit distance between the output sequences and ground-truth sequences. The middle and right figure show the runtime and memory-usage respectively.

See this image and copyright information in PMC

References

1. Baaijens J.A., Bonizzoni P., Boucher C., Della Vedova G., Pirola Y., Rizzi R., Sirén J.: Computational graph pangenomics: a tutorial on data structures and their applications. Natural Computing pp. 1–28 (2022) - PMC - PubMed
1. Bradbury P.J., Casstevens T., Jensen S.E., Johnson L., Miller Z., Monier B., Romay M., Song B., Buckler E.S.: The practical haplotype graph, a platform for storing and using pangenomes for imputation. Bioinformatics 38(15), 3698–3702 (2022) - PMC - PubMed
1. Chandra G., Gibney D., Jain C.: Haplotype-aware sequence alignment to pangenome graphs. Genome Research 34(9), 1265–1275 (2024) - PMC - PubMed
1. Computational Pan-Genomics Consortium: Computational pan-genomics: status, promises and challenges. Briefings in bioinformatics 19(1), 118–135 (2018) - PMC - PubMed
1. Davies R.W., Kucka M., Su D., et al. : Rapid genotype imputation from sequence with reference panels. Nature Genetics 53(7), 1104–1111 (Jun 2021). 10.1038/s41588-021-00877-0 - DOI - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Integer programming framework for pangenome-based genome inference

Affiliations

Integer programming framework for pangenome-based genome inference

Authors

Affiliations

Update in

Abstract

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous