. 2024 Sep 26:4:1397036.

doi: 10.3389/fbinf.2024.1397036. eCollection 2024.

Pangenome comparison via ED strings

Esteban Gabory¹, Moses Njagi Mwaniki², Nadia Pisanti², Solon P Pissis^{1

3}, Jakub Radoszewski⁴, Michelle Sweering¹, Wiktor Zuba¹

Affiliations

¹ Centrum Wiskunde & Informatica, Amsterdam, Netherlands.
² Department of Computer Science, University of Pisa, Pisa, Italy.
³ Department of Computer Science, Vrije Universiteit, Amsterdam, Netherlands.
⁴ Institute of Informatics, University of Warsaw, Warsaw, Poland.

PMID: 39391331
PMCID: PMC11464492
DOI: 10.3389/fbinf.2024.1397036

Pangenome comparison via ED strings

Esteban Gabory et al. Front Bioinform. 2024.

. 2024 Sep 26:4:1397036.

doi: 10.3389/fbinf.2024.1397036. eCollection 2024.

Authors

Esteban Gabory¹, Moses Njagi Mwaniki², Nadia Pisanti², Solon P Pissis^{1

3}, Jakub Radoszewski⁴, Michelle Sweering¹, Wiktor Zuba¹

Affiliations

¹ Centrum Wiskunde & Informatica, Amsterdam, Netherlands.
² Department of Computer Science, University of Pisa, Pisa, Italy.
³ Department of Computer Science, Vrije Universiteit, Amsterdam, Netherlands.
⁴ Institute of Informatics, University of Warsaw, Warsaw, Poland.

PMID: 39391331
PMCID: PMC11464492
DOI: 10.3389/fbinf.2024.1397036

Abstract

Introduction: An elastic-degenerate (ED) string is a sequence of sets of strings. It can also be seen as a directed acyclic graph whose edges are labeled by strings. The notion of ED strings was introduced as a simple alternative to variation and sequence graphs for representing a pangenome, that is, a collection of genomic sequences to be analyzed jointly or to be used as a reference.

Methods: In this study, we define notions of matching statistics of two ED strings as similarity measures between pangenomes and, consequently infer a corresponding distance measure. We then show that both measures can be computed efficiently, in both theory and practice, by employing the intersection graph of two ED strings.

Results: We also implemented our methods as a software tool for pangenome comparison and evaluated their efficiency and effectiveness using both synthetic and real datasets.

Discussion: As for efficiency, we compare the runtime of the intersection graph method against the classic product automaton construction showing that the intersection graph is faster by up to one order of magnitude. For showing effectiveness, we used real SARS-CoV-2 datasets and our matching statistics similarity measure to reproduce a well-established clade classification of SARS-CoV-2, thus demonstrating that the classification obtained by our method is in accordance with the existing one.

Keywords: SARS-CoV-2; elastic-degenerate string; intersection graph; matching statistics; pangenome comparison.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**FIGURE 1**
An example of an MSA (top left) and its corresponding (non-unique) ED string $T$ of length $n = 7$ , cardinality $m = 11$ and size $N = 20$ (top right), and edge-labeled DAG for $T$ . Note that $ε$ denotes the empty string. The DAG can also be viewed as an NFA with extended (multi-letter) transitions.

**FIGURE 2**
The two DAGs $G_{1}$ and $G_{2}$ for ED strings $T_{1}$ and $T_{2}$ . The filled black nodes are explicit states, while the orange empty nodes are implicit states.

**FIGURE 3**
Intersection graph $G$ for $T_{1}$ and $T_{2}$ , where $G_{1}$ and $G_{2}$ are shown at the left and on the top, respectively, to simplify the understanding of $G$ . A node $(i, j)$ in the intersection is represented by a square if both $i$ and $j$ are explicit nodes, and by a circle if only one of them is. The dashed edges of the intersection graph $G$ correspond to $ε$ -transitions (namely, transitions such that no letter is read when traversed), while the solid edges correspond to the other extended transitions. A string in $L (T_{1}) \cap L (T_{2})$ corresponds to a path from the starting node of $G$ to the accepting node. Here the intersection is nonempty and contains a single string $A C$ , which can be read on the red path.

**FIGURE 4**
Matching Statistics of $T_{1}$ and $T_{2}$ of our running example on their intersection graph $G$ , where, again - to simplify the understanding - we also draw $G_{1}$ and $G_{2}$ at the left and on the top, respectively. Note that this time, the pairs of implicit nodes that are reachable in a single extended transition from a pair that was previously computed are added. In the figure, there is only one such extra node, which is represented by a green open circle at the right of the graph. Here we highlight the paths that are relevant for computing the Matching Statistics array ${MS}_{T_{1}, T_{2}}$ . To compute ${MS}_{T_{1}, T_{2}} [1]$ , we look at the paths starting at nodes $(i, j)$ where $i$ is the explicit state one in the path-automaton of $T_{1}$ , and return the length of the longest label of such a path. These are the paths starting in one of the blue nodes (these are the nodes that correspond to the uppermost explicit node of $G_{1}$ paired with any node of $G_{2}$ , that is, they correspond to the uppermost dotted copy of $G_{2}$ ). The longest one of such paths (also drawn in blue) corresponds to the string $T G C$ having length 3; therefore, ${MS}_{T_{1}, T_{2}} [1] = 3$ . For ${MS}_{T_{1}, T_{2}} [2]$ we do the same but using as starting nodes those in red that correspond to the internal explicit node of $G_{1}$ paired with any node of $G_{2}$ (i.e., the nodes of the middle dotted copy of $G_{2}$ ). Here the longest path is drawn in red and it spells the string $C A$ , and therefore we set ${MS}_{T_{1}, T_{2}} [2] = 2$ . Computing ${MS}_{T_{2}, T_{1}}$ can be performed in a dual manner on the same graph, but using as starting nodes those of the leftmost dotted copy of $G_{1}$ for ${MS}_{T_{2}, T_{1}} [1]$ , and those of the middle dotted copy of $G_{1}$ for ${MS}_{T_{2}, T_{1}} [2]$ .

**FIGURE 5**
Breakpoint Matching Statistics computation in the intersection graph $G$ of $T_{1}$ and $T_{2}$ . To compute ${BMS}_{T_{1}, T_{2}} [1]$ , the candidate starting nodes of the match in $G$ are those in blue: nodes $(i, j)$ where $i$ is an explicit state of $T_{1}$ in the uppermost dotted copy of $G_{2}$ , and $j$ is either an explicit state of $T_{2}$ (squared blue nodes) or an implicit one (circled blue nodes). Note that $T G C$ is the longest match that starts at the first set of $T_{1}$ but it does not fulfill the conditions for the Breakpoint Matching Statistics because it does not end at any breakpoint; for the same reason, $T G$ is also not a good candidate match. The occurrence of $A C$ corresponding to the blue edge starts at a blue square node; hence it is reachable from the node itself that corresponds to a pair of explicit states, and it ends at a node that is again a pair of explicit states, and hence a breakpoint for both $T_{1}$ and $T_{2}$ . There is no longer match satisfying these conditions; therefore we set ${BMS}_{T_{1}, T_{2}} [1] = 2$ . For ${BMS}_{T_{1}, T_{2}} [2]$ we do the same but use as starting nodes those in red that correspond to the internal explicit node of $G_{1}$ paired with any node of $G_{2}$ (i.e., the nodes of the middle dotted copy of $G_{2}$ ). The red path spelling $C$ : (i) is a prefix in $T_{1} [2]$ starting at an explicit node of $T_{1}$ ; (ii) is reachable from a square node in $G$ by spelling $A$ in both strings (curved brown red edge labeled with $A$ ); and (iii) ends where $T_{2} [2]$ does, that is, at a breakpoint. Since this is the longest such path in $G$ , we set ${BMS}_{T_{1}, T_{2}} [2] = 1$ . Note, for example, that the match $C A$ that occurs in $T_{1} [2]$ and inside $T_{2} [2]$ cannot be used for ${BMS}_{T_{1}, T_{2}} [2]$ because it starts at a node that is not reachable from a pair of explicit nodes, meaning that it is not upperbounded by a breakpoint in $T_{2}$ . Computing ${BMS}_{T_{2}, T_{1}}$ , which is of size $n_{2} = 2$ , can be done in a dual manner on the very same graph, using as starting nodes those of the leftmost dotted copy of $G_{1}$ for ${BMS}_{T_{2}, T_{1}} [1] = 2$ (obtained by traversing an $ε$ -transition and then $A C$ ), and those of the middle dotted copy of $G_{1}$ for ${BMS}_{T_{2}, T_{1}} [2] = 2$ ( $A C$ again).

**FIGURE 6**
SARS-CoV-2 clades pairwise similarity graph generated according to average Breakpoint Matching Statistics. The annotation (all non grey nor black graphics and text) highlights similarities with Figure 7.

**FIGURE 7**
Phylogeny of 3357 SARS-CoV-2 genomes samples. The figure is generated and downloaded from Nextstrain https://nextstrain.org/ncov/open/global/all-time (2024), and some annotation is added here to highlight similarities with the graph of Figure 6.

See this image and copyright information in PMC

References

1. Alzamel M., Ayad L. A. K., Bernardini G., Grossi R., Iliopoulos C. S., Pisanti N., et al. (2018). “Degenerate string comparison and applications,”. 18th international workshop on algorithms in bioinformatics, WABI 2018, August 20-22, 2018, Helsinki, Finland. Editors Parida L., Ukkonen E. (Schloss Dagstuhl: LIPIcs; ), 21, 1–21:14. 113 of LIPIcs . 10.4230/LIPIcs.WABI.2018.21 - DOI
1. Alzamel M., Ayad L. A. K., Bernardini G., Grossi R., Iliopoulos C. S., Pisanti N., et al. (2020). Comparing degenerate strings. Fundam. Inf. 175, 41–58. 10.3233/FI-2020-1947 - DOI
1. Aoyama K., Nakashima Y., I T., Inenaga S., Bannai H., Takeda M. (2018). “Faster online elastic degenerate string matching,”. Annual symposium on combinatorial pattern matching, CPM 2018, july 2-4, 2018 - qingdao, China. Editors Navarro G., Sankoff D., Zhu B. (Schloss Dagstuhl: LIPIcs; ), 9, 1–9:10. 10.4230/LIPIcs.CPM.2018.9 - DOI
1. Apostolico A., Guerra C., Landau G. M., Pizzi C. (2016). Sequence similarity measures based on bounded hamming distance. Theor. Comput. Sci. 638, 76–90. 10.1016/J.TCS.2016.01.023 - DOI
1. Apostolico A., Guerra C., Pizzi C. (2014). “Alignment free sequence similarity with bounded hamming distance,” in Data compression conference, DCC 2014, snowbird, UT, USA, 26-28 march, 2014. Editors Bilgin A., Marcellin M. W., Serra-Sagristà J., Storer J. A. (IEEE; ), 183–192. 10.1109/DCC.2014.57 - DOI

LinkOut - more resources

Full Text Sources
- Frontiers Media SA
- PubMed Central
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Pangenome comparison via ED strings

Affiliations

Pangenome comparison via ED strings

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous