PanPA: generation and alignment of panproteome graphs

Fawaz Dabbaghie^{1

2

3}, Sanjay K Srikakulam^{3

4

5}, Tobias Marschall^{1

2}, Olga V Kalinina^{3

6

7}

Affiliations

¹ Institute for Medical Biometry and Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, 40225 Düsseldorf, Germany.
² Center for Digital Medicine, Heinrich Heine University, 40225 Düsseldorf, Germany.
³ Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Center for Infection Research (HZI), Saarbrücken, Germany.
⁴ Graduate School of Computer Science, Saarland University, 66123 Saarbrücken, Germany.
⁵ Interdisciplinary Graduate School of Natural Product Research, Saarland University, 66123 Saarbrücken, Germany.
⁶ Drug Bioinformatics, Medical Faculty, Saarland University, 66421 Homburg, Germany.
⁷ Center for Bioinformatics, Saarland University, 66123 Saarbrücken, Germany.

PMID: 38145107
PMCID: PMC10748787
DOI: 10.1093/bioadv/vbad167

PanPA: generation and alignment of panproteome graphs

Fawaz Dabbaghie et al. Bioinform Adv. 2023.

. 2023 Nov 24;3(1):vbad167.

doi: 10.1093/bioadv/vbad167. eCollection 2023.

Authors

Fawaz Dabbaghie^{1

2

3}, Sanjay K Srikakulam^{3

4

5}, Tobias Marschall^{1

2}, Olga V Kalinina^{3

6

7}

Affiliations

¹ Institute for Medical Biometry and Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, 40225 Düsseldorf, Germany.
² Center for Digital Medicine, Heinrich Heine University, 40225 Düsseldorf, Germany.
³ Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Center for Infection Research (HZI), Saarbrücken, Germany.
⁴ Graduate School of Computer Science, Saarland University, 66123 Saarbrücken, Germany.
⁵ Interdisciplinary Graduate School of Natural Product Research, Saarland University, 66123 Saarbrücken, Germany.
⁶ Drug Bioinformatics, Medical Faculty, Saarland University, 66421 Homburg, Germany.
⁷ Center for Bioinformatics, Saarland University, 66123 Saarbrücken, Germany.

PMID: 38145107
PMCID: PMC10748787
DOI: 10.1093/bioadv/vbad167

Abstract

Motivation: Compared to eukaryotes, prokaryote genomes are more diverse through different mechanisms, including a higher mutation rate and horizontal gene transfer. Therefore, using a linear representative reference can cause a reference bias. Graph-based pangenome methods have been developed to tackle this problem. However, comparisons in DNA space are still challenging due to this high diversity. In contrast, amino acid sequences have higher similarity due to evolutionary constraints, whereby a single amino acid may be encoded by several synonymous codons. Coding regions cover the majority of the genome in prokaryotes. Thus, panproteomes present an attractive alternative leveraging the higher sequence similarity while not losing much of the genome in non-coding regions.

Results: We present PanPA, a method that takes a set of multiple sequence alignments of protein sequences, indexes them, and builds a graph for each multiple sequence alignment. In the querying step, it can align DNA or amino acid sequences back to these graphs. We first showcase that PanPA generates correct alignments on a panproteome from 1350 Escherichia coli. To demonstrate that panproteomes allow comparisons at longer phylogenetic distances, we compare DNA and protein alignments from 1073 Salmonella enterica assemblies against E.coli reference genome, pangenome, and panproteome using BWA, GraphAligner, and PanPA, respectively; with PanPA aligning around 22% more sequences. We also aligned a DNA short-reads whole genome sequencing (WGS) sample from S.enterica against the E.coli reference with BWA and the panproteome with PanPA, where PanPA was able to find alignment for 68% of the reads compared to 5% with BWA.

Availalability and implementation: PanPA is available at https://github.com/fawaz-dabbaghieh/PanPA.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

**Figure 1.**
MSA to GFA: turning an MSA into a graph. The MSA in this example contains three sequences, - *MEPTPEQ*, - - - *T—MA*, and *MSETQSTQ*; and the step-by-step graph construction is shown on the panels from top to bottom. At every step, the yellow column is the current position and the red column is the previous one.

**Figure 2.**
Alignment of a sequence to a protein graph. Top: example protein graph; bottom: the corresponding DP table. The ordered graph vertices are in the columns, and the query sequence is in the rows. Arrows between columns correspond to the graph edges. Arrows in the DP table correspond to potential previous cells in the DP process.

**Figure 3.**
The general PanPA pipeline and its subcommands (in blue). Each subcommand can be also run separately or more than once with different parameters.

**Figure 4.**
Effect of the different parameters on the fraction of wrongly aligned sequences, where a “wrong alignment” is a sequence being aligned to a different graph than the one it originated from. Each point is colored with respect to the seed hits limit (the limit of how many hits can each seed point to), and shapes correspond to the aligned hits limit (the limit of how many graphs can one sequence align to). For small k values, a high number of wrong alignments is produced, unless the index size is limited. The align seed limit has a relatively small effect on the percentage of wrong alignments.

**Figure 5.**
Effect of the different parameters on the number of unaligned sequences when aligning 92 196 unseen *E.coli* sequences. For small k values, the majority of sequences were not aligned unless a limit for the index hits size is set (the red marks); if the index hits size is not limited, over 99% of sequences produce an alignment.

**Figure 6.**
Upset plot of the unique alignments of 4 839 981 sequences from the coding regions of 1074 *S.enterica* assemblies from RefSeq. Alignments with BWA and GraphAligner (DNA), and PanPA (amino acids) against their corresponding *E.coli* counterparts were constructed using the parameters in Supplementary Section S2.

**Figure 7.**
Distribution of identity scores between BWA, GraphAligner, and PanPA from aligning the *S.enterica* sequences. The pique for PanPA is shifted to the right, meaning higher sequence identity, as amino acid sequences align with higher identity compared to nucleotide sequences.

**Figure 8.**
Visualization of parts of the protein graphs for (a) GyrA and (b) ParC using Bandage (Wick *et al.* 2015). Nodes are colored according to the number of resistant/susceptible strains that pass through them, with blue color representing resistance, and with red representing susceptibility; the color intensity corresponds to the number of strains. Additional colored lines show the paths of the aligned 10% sequence that were set aside (45 resistant and 117 susceptible sequences), the color representing the type, and the thickness representing the number of sequences taking that path. A thick blue line of resistant sequences took the blue path passing through the blue nodes, and *vice versa*, a thick red line for susceptible sequences took the red path passing through the red nodes.

See this image and copyright information in PMC

References

1. Akutsu T. A Linear Time Pattern Matching Algorithm Between a String and a Tree, Combinatorial Pattern Matching, Lecture Notes in Computer Science, Vol. 684, Springer-Verlag, Berlin/Heidelberg, 1993, 1–10.
1. Amann RI, Ludwig W, Schleifer KH. et al. Phylogenetic identification and in situ detection of individual microbial cells without cultivation. Microbiol Rev 1995;59:143–69. - PMC - PubMed
1. Amir A, Lewenstein M, Lewenstein N. et al. Pattern matching in hypertext. J Algorithms 2000;35:82–99.
1. Bagel S, Hüllen V, Wiedemann B. et al. Impact of gyrA and parC mutations on quinolone resistance, doubling time, and supercoiling degree of Escherichia coli. Antimicrob Agents Chemother 1999;43:868–75. - PMC - PubMed
1. Bininda-Emonds ORP. TransAlign: using amino acids to facilitate the multiple alignment of protein-coding DNA sequences. BMC Bioinformatics 2005;6:156. - PMC - PubMed

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

PanPA: generation and alignment of panproteome graphs

Affiliations

PanPA: generation and alignment of panproteome graphs

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources