. 2022 Mar;21(1):81-108.

doi: 10.1007/s11047-022-09882-6. Epub 2022 Mar 4.

Computational graph pangenomics: a tutorial on data structures and their applications

Jasmijn A Baaijens^{1

2}, Paola Bonizzoni³, Christina Boucher⁴, Gianluca Della Vedova³, Yuri Pirola³, Raffaella Rizzi³, Jouni Sirén⁵

Affiliations

¹ Department of Intelligent Systems, Delft University of Technology, Van Mourik Broekmanweg 6, 2628XE Delft, The Netherlands.
² Department of Biomedical Informatics, Harvard University, 10 Shattuck St, Boston, MA 02115, USA.
³ Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, V.le Sarca, 336, 20126 Milan, Italy.
⁴ Department of Computer and Information Science and Engineering, University of Florida, 432 Newell Dr, Gainesville, FL 32603, USA.
⁵ Genomics Institute, University of California, 1156 High St., Santa Cruz, CA 95064, USA.

PMID: 36969737
PMCID: PMC10038355
DOI: 10.1007/s11047-022-09882-6

Computational graph pangenomics: a tutorial on data structures and their applications

Jasmijn A Baaijens et al. Nat Comput. 2022 Mar.

. 2022 Mar;21(1):81-108.

doi: 10.1007/s11047-022-09882-6. Epub 2022 Mar 4.

Authors

Jasmijn A Baaijens^{1

2}, Paola Bonizzoni³, Christina Boucher⁴, Gianluca Della Vedova³, Yuri Pirola³, Raffaella Rizzi³, Jouni Sirén⁵

Affiliations

¹ Department of Intelligent Systems, Delft University of Technology, Van Mourik Broekmanweg 6, 2628XE Delft, The Netherlands.
² Department of Biomedical Informatics, Harvard University, 10 Shattuck St, Boston, MA 02115, USA.
³ Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, V.le Sarca, 336, 20126 Milan, Italy.
⁴ Department of Computer and Information Science and Engineering, University of Florida, 432 Newell Dr, Gainesville, FL 32603, USA.
⁵ Genomics Institute, University of California, 1156 High St., Santa Cruz, CA 95064, USA.

PMID: 36969737
PMCID: PMC10038355
DOI: 10.1007/s11047-022-09882-6

Abstract

Computational pangenomics is an emerging research field that is changing the way computer scientists are facing challenges in biological sequence analysis. In past decades, contributions from combinatorics, stringology, graph theory and data structures were essential in the development of a plethora of software tools for the analysis of the human genome. These tools allowed computational biologists to approach ambitious projects at population scale, such as the 1000 Genomes Project. A major contribution of the 1000 Genomes Project is the characterization of a broad spectrum of genetic variations in the human genome, including the discovery of novel variations in the South Asian, African and European populations-thus enhancing the catalogue of variability within the reference genome. Currently, the need to take into account the high variability in population genomes as well as the specificity of an individual genome in a personalized approach to medicine is rapidly pushing the abandonment of the traditional paradigm of using a single reference genome. A graph-based representation of multiple genomes, or a graph pangenome, is replacing the linear reference genome. This means completely rethinking well-established procedures to analyze, store, and access information from genome representations. Properly addressing these challenges is crucial to face the computational tasks of ambitious healthcare projects aiming to characterize human diversity by sequencing 1M individuals (Stark et al. 2019). This tutorial aims to introduce readers to the most recent advances in the theory of data structures for the representation of graph pangenomes. We discuss efficient representations of haplotypes and the variability of genotypes in graph pangenomes, and highlight applications in solving computational problems in human and microbial (viral) pangenomes.

PubMed Disclaimer

Figures

**Fig. 1**
A toy example of how a pangenome graph improves the quality of mapping reads to a reference genome. a A multiple sequence alignment of a linear reference genome and other three genomes that contain variations w.r.t. the reference. b A variation graph built from the matrix of the multiple alignment of the genomes; in red the edges that represent variations in the graph and form the typical “bubbles” in the graph. Observe that the graph may contain a path that does not represent any input genome (for example, ACCGTTAAGGGCGATCGAACTCGTTTT). c Mapping of two reads (ACCGTTAAGCGA and ACCGTTAAGCGA) to the linear reference genome. Observe that the alignments induces mismatches and indels. d Mapping of the same reads to the variation graph. Observe that, in this case, the mapping is possible without any mismatch

**Fig. 2**
Example of a variation graph with two dummy vertices: a source and a sink

**Fig. 3**
Example of an alignment (left) of four genomes and a corresponding variation graph (right). The set I of disjoint intervals is in the lower left part of the figures, and each interval is connected with the corresponding set of columns of the alignment. The variation graph has two dummy vertices: a source and a sink, so that each genome corresponds to source-sink walk in the graph. The alignment of the third genome has a block consisting of only a gap; hence, it does not correspond to any vertex of the graph. The red and the green paths identify a variant, also called bubble, in the graph, since they have the same source and sink, while all other vertices are disjoint

**Fig. 4**
Example of a variation graph constructed from four sequences, each represented by a different colored symbol. We color only vertices to simplify the figure

**Fig. 5**
A toy example of how a pattern matches on a variation graph. The pattern is the string TGCAT and the variation graph is the one of Fig. 4. The walk with red vertices and arcs contains the match, but the actual match consists of the underlined portions of the vertex labels. More precisely, the match takes a suffix of the first vertex and a prefix of the last vertex

**Fig. 6**
Example of a panel X of haplotypes with the original order (left) and with the order induced by a₁₄ (right). The arrow highlights that x₁ is the 6th haplotype in the order induced by the lexicographic order of the 14-long reverse prefixes (hence, it is denoted with $y_{6}^{14}$ ). On the right, we reported also the divergence array d₁₄ and we underlined the left-maximal matches ending at position 14 between each $x_{a_{14} [i - 1]}$ and $x_{a_{14} [i]}$ . Position 15 is highlighted and the permutation of the symbols (alleles) at that position induced by a₁₄ is denoted by y¹⁵. That permutation of symbols will be used to compute a₁₅

**Fig. 7**
Computing array a₁₅ from a₁₄. All the elements of a₁₄ whose corresponding character in y₁₅ (*i.e*.,, $x_{a_{k} [\cdot]} [k]$ ) is 0 are placed in a₁₅ before the elements of a₁₄ whose corresponding character in y₁₅ is 1

**Fig. 8**
Computing the arrays a₁₅ and d₁₅. On the left there are the arrays a₁₄ and d₁₄ and the set X sorted by the revpref at position 14. On the right there are the set X sorted by the revpref at position 15 and the arrays a₁₅ and d₁₅. Notice that the set X is not sorted explicitly by the algorithm, and is reported here to make easier to understand the algorithm. The interval that is analyzed to compute the value of the divergence array at position 15 associated with x₂ is represented with a square bracket

**Fig. 9**
A panel of ten tri-allelic haplotypes in their ordering at 20. Haplotype $y_{2}^{20}$ (which is haplotype x₇ in the original panel X) has a candidate set-maximal match from position 16 to position 20 with haplotypes $y_{1}^{20} (x_{5})$ and $y_{3}^{20} (x_{1})$ since d₂₀[2] = d₂₀[3] while d₂₀[1] and d₂₀[4] are both greater that d₂₀[2]. However, since $y_{1}^{20} [21]$ and $y_{3}^{20} [21]$ are both equal to $y_{2}^{20} [21]$ , then the match is not right-maximal and, hence, is not set-maximal. It will be found while scanning column 21 or later. Similarly, $y_{6}^{20}$ has a candidate set-maximal match from 17 to 20 with $y_{7}^{20}$ and $y_{8}^{20}$ . It is an actual set-maximal match because $y_{6}^{20} [21]$ is different from both $y_{7}^{20} [21]$ and $y_{8}^{20} [21]$ . Observe that $y_{7}^{20}$ has not a set-maximal match ending at position 20 because the candidate match from 17 to 20 is with $y_{6}^{20}$ and $y_{8}^{20}$ but $y_{7}^{20} [21] = y_{8}^{20} [21]$ (hence, it will be found while scanning column 21 or later)

**Fig. 10**
Partitioning the BWT into substrings **BWT**_v corresponding to vertices v ∈ V and the representation of BWT offsets i as pairs (v, i′)

**Fig. 11**
The record for vertex v₃ with outgoing paths to v₄, v₅, and v₆. The top part of the record is the vertex identifier. The middle part stores a pair (w, **BWT**.rank(C[v], w)) for each outgoing edge (v, w). The bottom part is **BWT**_v encoded using edge ranks. Observe that there are two paths visiting vertex v₄ from vertices smaller than v₃. Hence, record for vertex v₃ stores the pair (v₄, 2)

**Fig. 12**
The GBWT in Example 2. As in Fig. 11, the top part of each record is the vertex identifier v. The middle part stores a pair (w, **BWT**.rank(C[v], w)) for each outgoing edge (v, w). The bottom part is BWT_v encoded using edge ranks

**Fig. 14**
Dictionary and parse of the set GATTACAT, GATACAT, and GATTAGATA of genomes for w = 2

**Fig. 15**
An illustration of the thresholds and matching statistics for identifying pattern R (left) in the string S (right). We give the longest prefix of the suffix of R that occurs in S, its length (len), and its position S (pos). We give the SA, LCP, the thresholds (THR) and BWT for S. The longest common prefix between each consecutive rotations of S is highlighted in red

**Fig. 16**
A toy example to illustrate the process of viral haplotype assembly. In this example, the task is to obtain the genome variation graph (a viral pangenome) by reconstructing the viral haplotypes from sequencing data, with haplotypes present at different abundances (here 30 vs. 70%). Stars below the original sequences indicate the three positions where the two haplotypes differ. The three data structures involved in the assembly process are (1) an overlap graph, where vertices represent sequencing reads and arcs indicate suffix-prefix overlaps; (2) a de Bruijn graph, where vertexs represent k-mers and arcs indicate overlaps of length k − 1; (3) a variation graph, first constructed from the extended sequences (contigs) obtained through genome assembly, which can be transformed into a genome variation graph that represents the full-length haplotypes. *Note that this example is a simplistic representation of reality: sequencing errors are not shown, hence all overlaps between reads are exact*

See this image and copyright information in PMC

References

1. Abouelhoda M, Kurtz S, Ohlebusch E (2004) Replacing suffix trees with enhanced suffix arrays. J Discret Algorithms 2(1):53–86. 10.1016/S1570-8667(03)00065-0 - DOI
1. Baaijens JA, Zine El Aabidine A, Rivals E et al. (2017) De novo assembly of viral quasispecies using overlap graphs. Genome Res 27(5):835–848. 10.1101/gr.215038.116 - DOI - PMC - PubMed
1. Baaijens JA, Van der Roest B, Köster J et al. (2019) Full-length de novo viral quasispecies assembly through variation graph construction. Bioinformatics 35(24):5086–5094. 10.1093/bioinformatics/btz443 - DOI - PubMed
1. Baaijens JA, Stougie L, Schönhuth A (2020) Strain-aware assembly of genomes from mixed samples using flow variation graphs. bioRxiv:645721. 10.1101/645721 - DOI
1. Ballouz S, Dobin A, Gillis JA (2019) Is it time to change the reference genome? Genome Biol. 10.1186/s13059-019-1774-4 - DOI - PMC - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Computational graph pangenomics: a tutorial on data structures and their applications

Affiliations

Computational graph pangenomics: a tutorial on data structures and their applications

Authors

Affiliations

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous