Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Nov 24;3(1):vbad167.
doi: 10.1093/bioadv/vbad167. eCollection 2023.

PanPA: generation and alignment of panproteome graphs

Affiliations

PanPA: generation and alignment of panproteome graphs

Fawaz Dabbaghie et al. Bioinform Adv. .

Abstract

Motivation: Compared to eukaryotes, prokaryote genomes are more diverse through different mechanisms, including a higher mutation rate and horizontal gene transfer. Therefore, using a linear representative reference can cause a reference bias. Graph-based pangenome methods have been developed to tackle this problem. However, comparisons in DNA space are still challenging due to this high diversity. In contrast, amino acid sequences have higher similarity due to evolutionary constraints, whereby a single amino acid may be encoded by several synonymous codons. Coding regions cover the majority of the genome in prokaryotes. Thus, panproteomes present an attractive alternative leveraging the higher sequence similarity while not losing much of the genome in non-coding regions.

Results: We present PanPA, a method that takes a set of multiple sequence alignments of protein sequences, indexes them, and builds a graph for each multiple sequence alignment. In the querying step, it can align DNA or amino acid sequences back to these graphs. We first showcase that PanPA generates correct alignments on a panproteome from 1350 Escherichia coli. To demonstrate that panproteomes allow comparisons at longer phylogenetic distances, we compare DNA and protein alignments from 1073 Salmonella enterica assemblies against E.coli reference genome, pangenome, and panproteome using BWA, GraphAligner, and PanPA, respectively; with PanPA aligning around 22% more sequences. We also aligned a DNA short-reads whole genome sequencing (WGS) sample from S.enterica against the E.coli reference with BWA and the panproteome with PanPA, where PanPA was able to find alignment for 68% of the reads compared to 5% with BWA.

Availalability and implementation: PanPA is available at https://github.com/fawaz-dabbaghieh/PanPA.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
MSA to GFA: turning an MSA into a graph. The MSA in this example contains three sequences, - MEPTPEQ, - - - T—MA, and MSETQSTQ; and the step-by-step graph construction is shown on the panels from top to bottom. At every step, the yellow column is the current position and the red column is the previous one.
Figure 2.
Figure 2.
Alignment of a sequence to a protein graph. Top: example protein graph; bottom: the corresponding DP table. The ordered graph vertices are in the columns, and the query sequence is in the rows. Arrows between columns correspond to the graph edges. Arrows in the DP table correspond to potential previous cells in the DP process.
Figure 3.
Figure 3.
The general PanPA pipeline and its subcommands (in blue). Each subcommand can be also run separately or more than once with different parameters.
Figure 4.
Figure 4.
Effect of the different parameters on the fraction of wrongly aligned sequences, where a “wrong alignment” is a sequence being aligned to a different graph than the one it originated from. Each point is colored with respect to the seed hits limit (the limit of how many hits can each seed point to), and shapes correspond to the aligned hits limit (the limit of how many graphs can one sequence align to). For small k values, a high number of wrong alignments is produced, unless the index size is limited. The align seed limit has a relatively small effect on the percentage of wrong alignments.
Figure 5.
Figure 5.
Effect of the different parameters on the number of unaligned sequences when aligning 92 196 unseen E.coli sequences. For small k values, the majority of sequences were not aligned unless a limit for the index hits size is set (the red marks); if the index hits size is not limited, over 99% of sequences produce an alignment.
Figure 6.
Figure 6.
Upset plot of the unique alignments of 4 839 981 sequences from the coding regions of 1074 S.enterica assemblies from RefSeq. Alignments with BWA and GraphAligner (DNA), and PanPA (amino acids) against their corresponding E.coli counterparts were constructed using the parameters in Supplementary Section S2.
Figure 7.
Figure 7.
Distribution of identity scores between BWA, GraphAligner, and PanPA from aligning the S.enterica sequences. The pique for PanPA is shifted to the right, meaning higher sequence identity, as amino acid sequences align with higher identity compared to nucleotide sequences.
Figure 8.
Figure 8.
Visualization of parts of the protein graphs for (a) GyrA and (b) ParC using Bandage (Wick et al. 2015). Nodes are colored according to the number of resistant/susceptible strains that pass through them, with blue color representing resistance, and with red representing susceptibility; the color intensity corresponds to the number of strains. Additional colored lines show the paths of the aligned 10% sequence that were set aside (45 resistant and 117 susceptible sequences), the color representing the type, and the thickness representing the number of sequences taking that path. A thick blue line of resistant sequences took the blue path passing through the blue nodes, and vice versa, a thick red line for susceptible sequences took the red path passing through the red nodes.

References

    1. Akutsu T. A Linear Time Pattern Matching Algorithm Between a String and a Tree, Combinatorial Pattern Matching, Lecture Notes in Computer Science, Vol. 684, Springer-Verlag, Berlin/Heidelberg, 1993, 1–10.
    1. Amann RI, Ludwig W, Schleifer KH. et al. Phylogenetic identification and in situ detection of individual microbial cells without cultivation. Microbiol Rev 1995;59:143–69. - PMC - PubMed
    1. Amir A, Lewenstein M, Lewenstein N. et al. Pattern matching in hypertext. J Algorithms 2000;35:82–99.
    1. Bagel S, Hüllen V, Wiedemann B. et al. Impact of gyrA and parC mutations on quinolone resistance, doubling time, and supercoiling degree of Escherichia coli. Antimicrob Agents Chemother 1999;43:868–75. - PMC - PubMed
    1. Bininda-Emonds ORP. TransAlign: using amino acids to facilitate the multiple alignment of protein-coding DNA sequences. BMC Bioinformatics 2005;6:156. - PMC - PubMed