Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Mar;21(1):81-108.
doi: 10.1007/s11047-022-09882-6. Epub 2022 Mar 4.

Computational graph pangenomics: a tutorial on data structures and their applications

Affiliations

Computational graph pangenomics: a tutorial on data structures and their applications

Jasmijn A Baaijens et al. Nat Comput. 2022 Mar.

Abstract

Computational pangenomics is an emerging research field that is changing the way computer scientists are facing challenges in biological sequence analysis. In past decades, contributions from combinatorics, stringology, graph theory and data structures were essential in the development of a plethora of software tools for the analysis of the human genome. These tools allowed computational biologists to approach ambitious projects at population scale, such as the 1000 Genomes Project. A major contribution of the 1000 Genomes Project is the characterization of a broad spectrum of genetic variations in the human genome, including the discovery of novel variations in the South Asian, African and European populations-thus enhancing the catalogue of variability within the reference genome. Currently, the need to take into account the high variability in population genomes as well as the specificity of an individual genome in a personalized approach to medicine is rapidly pushing the abandonment of the traditional paradigm of using a single reference genome. A graph-based representation of multiple genomes, or a graph pangenome, is replacing the linear reference genome. This means completely rethinking well-established procedures to analyze, store, and access information from genome representations. Properly addressing these challenges is crucial to face the computational tasks of ambitious healthcare projects aiming to characterize human diversity by sequencing 1M individuals (Stark et al. 2019). This tutorial aims to introduce readers to the most recent advances in the theory of data structures for the representation of graph pangenomes. We discuss efficient representations of haplotypes and the variability of genotypes in graph pangenomes, and highlight applications in solving computational problems in human and microbial (viral) pangenomes.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
A toy example of how a pangenome graph improves the quality of mapping reads to a reference genome. a A multiple sequence alignment of a linear reference genome and other three genomes that contain variations w.r.t. the reference. b A variation graph built from the matrix of the multiple alignment of the genomes; in red the edges that represent variations in the graph and form the typical “bubbles” in the graph. Observe that the graph may contain a path that does not represent any input genome (for example, ACCGTTAAGGGCGATCGAACTCGTTTT). c Mapping of two reads (ACCGTTAAGCGA and ACCGTTAAGCGA) to the linear reference genome. Observe that the alignments induces mismatches and indels. d Mapping of the same reads to the variation graph. Observe that, in this case, the mapping is possible without any mismatch
Fig. 2
Fig. 2
Example of a variation graph with two dummy vertices: a source and a sink
Fig. 3
Fig. 3
Example of an alignment (left) of four genomes and a corresponding variation graph (right). The set I of disjoint intervals is in the lower left part of the figures, and each interval is connected with the corresponding set of columns of the alignment. The variation graph has two dummy vertices: a source and a sink, so that each genome corresponds to source-sink walk in the graph. The alignment of the third genome has a block consisting of only a gap; hence, it does not correspond to any vertex of the graph. The red and the green paths identify a variant, also called bubble, in the graph, since they have the same source and sink, while all other vertices are disjoint
Fig. 4
Fig. 4
Example of a variation graph constructed from four sequences, each represented by a different colored symbol. We color only vertices to simplify the figure
Fig. 5
Fig. 5
A toy example of how a pattern matches on a variation graph. The pattern is the string TGCAT and the variation graph is the one of Fig. 4. The walk with red vertices and arcs contains the match, but the actual match consists of the underlined portions of the vertex labels. More precisely, the match takes a suffix of the first vertex and a prefix of the last vertex
Fig. 6
Fig. 6
Example of a panel X of haplotypes with the original order (left) and with the order induced by a14 (right). The arrow highlights that x1 is the 6th haplotype in the order induced by the lexicographic order of the 14-long reverse prefixes (hence, it is denoted with y614). On the right, we reported also the divergence array d14 and we underlined the left-maximal matches ending at position 14 between each xa14[i1] and xa14[i]. Position 15 is highlighted and the permutation of the symbols (alleles) at that position induced by a14 is denoted by y15. That permutation of symbols will be used to compute a15
Fig. 7
Fig. 7
Computing array a15 from a14. All the elements of a14 whose corresponding character in y15 (i.e.,, xak[][k]) is 0 are placed in a15 before the elements of a14 whose corresponding character in y15 is 1
Fig. 8
Fig. 8
Computing the arrays a15 and d15. On the left there are the arrays a14 and d14 and the set X sorted by the revpref at position 14. On the right there are the set X sorted by the revpref at position 15 and the arrays a15 and d15. Notice that the set X is not sorted explicitly by the algorithm, and is reported here to make easier to understand the algorithm. The interval that is analyzed to compute the value of the divergence array at position 15 associated with x2 is represented with a square bracket
Fig. 9
Fig. 9
A panel of ten tri-allelic haplotypes in their ordering at 20. Haplotype y220 (which is haplotype x7 in the original panel X) has a candidate set-maximal match from position 16 to position 20 with haplotypes y120(x5) and y320(x1) since d20[2] = d20[3] while d20[1] and d20[4] are both greater that d20[2]. However, since y120[21] and y320[21] are both equal to y220[21], then the match is not right-maximal and, hence, is not set-maximal. It will be found while scanning column 21 or later. Similarly, y620 has a candidate set-maximal match from 17 to 20 with y720 and y820. It is an actual set-maximal match because y620[21] is different from both y720[21] and y820[21]. Observe that y720 has not a set-maximal match ending at position 20 because the candidate match from 17 to 20 is with y620 and y820 but y720[21]=y820[21] (hence, it will be found while scanning column 21 or later)
Fig. 10
Fig. 10
Partitioning the BWT into substrings BWTv corresponding to vertices vV and the representation of BWT offsets i as pairs (v, i′)
Fig. 11
Fig. 11
The record for vertex v3 with outgoing paths to v4, v5, and v6. The top part of the record is the vertex identifier. The middle part stores a pair (w, BWT.rank(C[v], w)) for each outgoing edge (v, w). The bottom part is BWTv encoded using edge ranks. Observe that there are two paths visiting vertex v4 from vertices smaller than v3. Hence, record for vertex v3 stores the pair (v4, 2)
Fig. 12
Fig. 12
The GBWT in Example 2. As in Fig. 11, the top part of each record is the vertex identifier v. The middle part stores a pair (w, BWT.rank(C[v], w)) for each outgoing edge (v, w). The bottom part is BWTv encoded using edge ranks
Fig. 13
Fig. 13
The GBWT from Example 4
Fig. 14
Fig. 14
Dictionary and parse of the set GATTACAT, GATACAT, and GATTAGATA of genomes for w = 2
Fig. 15
Fig. 15
An illustration of the thresholds and matching statistics for identifying pattern R (left) in the string S (right). We give the longest prefix of the suffix of R that occurs in S, its length (len), and its position S (pos). We give the SA, LCP, the thresholds (THR) and BWT for S. The longest common prefix between each consecutive rotations of S is highlighted in red
Fig. 16
Fig. 16
A toy example to illustrate the process of viral haplotype assembly. In this example, the task is to obtain the genome variation graph (a viral pangenome) by reconstructing the viral haplotypes from sequencing data, with haplotypes present at different abundances (here 30 vs. 70%). Stars below the original sequences indicate the three positions where the two haplotypes differ. The three data structures involved in the assembly process are (1) an overlap graph, where vertices represent sequencing reads and arcs indicate suffix-prefix overlaps; (2) a de Bruijn graph, where vertexs represent k-mers and arcs indicate overlaps of length k − 1; (3) a variation graph, first constructed from the extended sequences (contigs) obtained through genome assembly, which can be transformed into a genome variation graph that represents the full-length haplotypes. Note that this example is a simplistic representation of reality: sequencing errors are not shown, hence all overlaps between reads are exact

References

    1. Abouelhoda M, Kurtz S, Ohlebusch E (2004) Replacing suffix trees with enhanced suffix arrays. J Discret Algorithms 2(1):53–86. 10.1016/S1570-8667(03)00065-0 - DOI
    1. Baaijens JA, Zine El Aabidine A, Rivals E et al. (2017) De novo assembly of viral quasispecies using overlap graphs. Genome Res 27(5):835–848. 10.1101/gr.215038.116 - DOI - PMC - PubMed
    1. Baaijens JA, Van der Roest B, Köster J et al. (2019) Full-length de novo viral quasispecies assembly through variation graph construction. Bioinformatics 35(24):5086–5094. 10.1093/bioinformatics/btz443 - DOI - PubMed
    1. Baaijens JA, Stougie L, Schönhuth A (2020) Strain-aware assembly of genomes from mixed samples using flow variation graphs. bioRxiv:645721. 10.1101/645721 - DOI
    1. Ballouz S, Dobin A, Gillis JA (2019) Is it time to change the reference genome? Genome Biol. 10.1186/s13059-019-1774-4 - DOI - PMC - PubMed