Haplotype-aware sequence alignment to pangenome graphs

Ghanshyam Chandra¹, Daniel Gibney², Chirag Jain³

Affiliations

¹ Department of Computational and Data Sciences, Indian Institute of Science, Bangalore Karnataka 560012, India.
² Department of Computer Science, The University of Texas at Dallas, Richardson, Texas 75080, USA.
³ Department of Computational and Data Sciences, Indian Institute of Science, Bangalore Karnataka 560012, India; chirag@iisc.ac.in.

PMID: 39013594
PMCID: PMC11529843
DOI: 10.1101/gr.279143.124

Haplotype-aware sequence alignment to pangenome graphs

Ghanshyam Chandra et al. Genome Res. 2024.

. 2024 Oct 11;34(9):1265-1275.

doi: 10.1101/gr.279143.124.

Authors

Ghanshyam Chandra¹, Daniel Gibney², Chirag Jain³

Affiliations

¹ Department of Computational and Data Sciences, Indian Institute of Science, Bangalore Karnataka 560012, India.
² Department of Computer Science, The University of Texas at Dallas, Richardson, Texas 75080, USA.
³ Department of Computational and Data Sciences, Indian Institute of Science, Bangalore Karnataka 560012, India; chirag@iisc.ac.in.

PMID: 39013594
PMCID: PMC11529843
DOI: 10.1101/gr.279143.124

Abstract

Modern pangenome graphs are built using haplotype-resolved genome assemblies. When mapping reads to a pangenome graph, prioritizing alignments that are consistent with the known haplotypes improves genotyping accuracy. However, the existing rigorous formulations for colinear chaining and alignment problems do not consider the haplotype paths in a pangenome graph. This often leads to spurious read alignments to those paths that are unlikely recombinations of the known haplotypes. In this paper, we develop novel formulations and algorithms for sequence-to-graph alignment and chaining problems. Inspired by the genotype imputation models, we assume that a query sequence is an imperfect mosaic of reference haplotypes. Accordingly, we introduce a recombination penalty in the scoring functions for each haplotype switch. First, we solve haplotype-aware sequence-to-graph alignment in [Formula: see text] time, where Q is the query sequence, E is the set of edges, and H is the set of haplotypes represented in the graph. To complement our solution, we prove that an algorithm significantly faster than [Formula: see text] is impossible under the strong exponential time hypothesis (SETH). Second, we propose a haplotype-aware chaining algorithm that runs in [Formula: see text] time after graph preprocessing, where N is the count of input anchors. We then establish that a chaining algorithm significantly faster than [Formula: see text] is impossible under SETH. As a proof-of-concept, we implemented our chaining algorithm in the Minichain aligner. By aligning sequences sampled from the human major histocompatibility complex (MHC) to a pangenome graph of 60 MHC haplotypes, we demonstrate that our algorithm achieves better consistency with ground-truth recombinations compared with a haplotype-agnostic algorithm.

PubMed Disclaimer

Figures

**Figure 1.**
An example of an acyclic pangenome graph built using three haplotypes. The *left* figure shows the optimal alignment of the query sequence to the DAG with minimum edit distance. Accordingly, the edit distance is zero, and the count of recombinations is two. Next, suppose that we use recombination penalty −5 (formally defined in Methods). The *right* figure shows the new optimal alignment, where there is no recombination because the alignment path is consistent with haplotype 2.

**Figure 2.**
An illustration of the pangenome DAG used in the reduction proof of Lemma 3. An initial version of the DAG (A) and the changes made after addition of all the haplotype paths (B).

**Figure 3.**
An example that shows anchors (in red) between a query sequence and a pangenome DAG. The DAG comprises three haplotype sequences, H₁, H₂, H₃. In this example, the ordered sequence of pairs [(2, 1), (4, 1), (5, 1), (9, 1), (10, 1), (13, 3), (14, 3)] forms a chain, and the corresponding anchors are highlighted using blue markers. This chain includes a single recombination.

**Figure 4.**
Reduction from OV problem with sets A = {0110, 1100, 1101} and B = {1010, 1001, 1011}.

**Figure 5.**
An example to illustrate simulation of a query sequence as a mosaic of reference haplotypes. In this example, l = 19.

**Figure 6.**
Pearson's correlation between the number of recombinations in Minichain's output chain and the true count. We evaluated the performance with substitution rates 0.1% (A), 1% (B), and 5% (C) and different recombination penalties.

**Figure 7.**
Box plots show the levels of consistency between the haplotype recombination pairs in Minichain's output chain and the ground truth using three different sets of simulated MHC sequences with substitution rates 0.1% (A), 1% (B), and 5% (C). We tested using different recombination penalties. Each red dot in the plots corresponds to a query sequence. The median values are highlighted in green.

See this image and copyright information in PMC

References

1. Abouelhoda M, Ohlebusch E. 2005. Chaining algorithms for multiple genome comparison. J Discrete Algorithms 3: 321–341. 10.1016/j.jda.2004.08.011 - DOI
1. Amir A, Lewenstein M, Lewenstein N. 2000. Pattern matching in hypertext. J Algorithms 35: 82–99. 10.1006/jagm.1999.1063 - DOI
1. Antipov D, Korobeynikov A, McLean JS, Pevzner PA. 2016. hybridSPAdes: an algorithm for hybrid assembly of short and long reads. Bioinformatics 32: 1009–1015. 10.1093/bioinformatics/btv688 - DOI - PMC - PubMed
1. Avila Cartes J, Bonizzoni P, Ciccolella S, Della Vedova G, Denti L, Didelot X, Monti DC, Pirola Y. 2024. RecGraph: recombination-aware alignment of sequences to variation graphs. Bioinformatics 40: btae292. 10.1093/bioinformatics/btae292 - DOI - PMC - PubMed
1. Backurs A, Indyk P. 2015. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). In Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, Portland, OR, pp. 51–58. Association for Computing Machinery.

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
- PubMed Central
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Haplotype-aware sequence alignment to pangenome graphs

Affiliations

Haplotype-aware sequence alignment to pangenome graphs

Authors

Affiliations

Abstract

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials