Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 May 2;40(5):btae292.
doi: 10.1093/bioinformatics/btae292.

RecGraph: recombination-aware alignment of sequences to variation graphs

Affiliations

RecGraph: recombination-aware alignment of sequences to variation graphs

Jorge Avila Cartes et al. Bioinformatics. .

Abstract

Motivation: Bacterial genomes present more variability than human genomes, which requires important adjustments in computational tools that are developed for human data. In particular, bacteria exhibit a mosaic structure due to homologous recombinations, but this fact is not sufficiently captured by standard read mappers that align against linear reference genomes. The recent introduction of pangenomics provides some insights in that context, as a pangenome graph can represent the variability within a species. However, the concept of sequence-to-graph alignment that captures the presence of recombinations has not been previously investigated.

Results: In this paper, we present the extension of the notion of sequence-to-graph alignment to a variation graph that incorporates a recombination, so that the latter are explicitly represented and evaluated in an alignment. Moreover, we present a dynamic programming approach for the special case where there is at most a recombination-we implement this case as RecGraph. From a modelling point of view, a recombination corresponds to identifying a new path of the variation graph, where the new arc is composed of two halves, each extracted from an original path, possibly joined by a new arc. Our experiments show that RecGraph accurately aligns simulated recombinant bacterial sequences that have at most a recombination, providing evidence for the presence of recombination events.

Availability and implementation: Our implementation is open source and available at https://github.com/AlgoLab/RecGraph.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
Example of alignment with a recombination of a sequence against a variation graph with three paths: p1 is red, p2 is green, and p3 is blue. The recombination is represented by the thick black arc connecting two diamond-shaped vertices ρ and ψ with grey background. The branching vertex α and the consolidating vertex β are represented by stars. The nodes of the path w have thick edges. Pairs with a grey background correspond to mismatches.
Figure 2.
Figure 2.
Results on alignments accuracy on simulated data. (a) Boxplot of Jaccard similarity coefficient between the real recombinant path and the path computed by RecGraph in no-recombination mode (RG no-rec.), GraphAligner (GA), and RecGraph in recombination mode (RG recomb.). (b) Boxplot of edit distance between the input recombinant sequence and the sequence spelled by the alignment path. (c) Boxplot of minimum number of recombinations explaining the alignment reported by the three tools. Boxplots are grouped by the mutation rate added to the input sequences (colors, from 0% to 10%).
Figure 3.
Figure 3.
Trees computed before and after each recombination breakpoint found by RecGraph. The real sequences are aligned using clustalw, and phylogenetic tree were constructed before and after the breakpoint position using the UPGMA algorithm. Real sequences are clustered according to recombination paths and positions. In each pair of trees the cluster of sequences is indicated by the green diamond, and the recombinant sequences are highlighted in red.
Figure 4.
Figure 4.
Number of recombinations found by JALI on the real dataset. (Left) JALI run with default parameters of gap extension (e) and recombination cost (j). (Right) JALI run with the optimized parameters chosen from the simulated experiment to obtain results most similar to RecGraph.

References

    1. Amir A, Lewenstein M, Lewenstein N. Pattern matching in hypertext. J Algorithms 2000;35:82–99.
    1. Baaijens JA, Bonizzoni P, Boucher C et al. Computational graph pangenomics: a tutorial on data structures and their applications. Nat Comput 2022;21:81–108. - PMC - PubMed
    1. Bonnet K, Marschall T, Doerr D. Constructing founder sets under allelic and non-allelic homologous recombination. Algorithms Mol Biol 2023;18:15. - PMC - PubMed
    1. Colquhoun RM, Hall MB, Lima L et al. Pandora: nucleotide-resolution bacterial pan-genomics with reference graphs. Genome Biol 2021;22:267. - PMC - PubMed
    1. Computational Pan-Genomics Consortium. Computational pan-genomics: status, promises and challenges. Brief Bioinf 2018;19:118–35. - PMC - PubMed

Publication types