Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 May 6;41(5):btaf219.
doi: 10.1093/bioinformatics/btaf219.

Pitfalls of bacterial pan-genome analysis approaches: a case study of Mycobacterium tuberculosis and two less clonal bacterial species

Affiliations

Pitfalls of bacterial pan-genome analysis approaches: a case study of Mycobacterium tuberculosis and two less clonal bacterial species

Maximillian G Marin et al. Bioinformatics. .

Abstract

Summary: Pan-genome analysis is a fundamental tool for studying bacterial genome evolution; however, the variety in methods used to define and measure the pan-genome poses challenges to the interpretation and reliability of results. Using Mycobacterium tuberculosis, a clonally evolving bacterium with a small accessory genome, as a model system, we systematically evaluated sources of variability in pan-genome estimates. Our analysis revealed that differences in assembly type (short-read versus hybrid), annotation pipeline, and pan-genome software, significantly impact predictions of core and accessory genome size. Extending our analysis to two additional bacterial species, Escherichia coli and Staphylococcus aureus, we observed consistent tool-dependent biases but species-specific patterns in pan-genome variability. Our findings highlight the importance of integrating nucleotide- and protein-level analyses to improve the reliability and reproducibility of pan-genome studies across diverse bacterial populations.

Availability and implementation: Panqc is freely available under an MIT license at https://github.com/maxgmarin/panqc.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Summary of dataset of 151 complete Mtb genomes. Left: Maximum likelihood phylogeny of all 151 genomes, colored according to their lineage (L1-6, L8). Right: Heatmap of pairwise ANI. Below: Distribution of pairwise ANI values, and corresponding heatmap colorbar.
Figure 2.
Figure 2.
Characteristics of Mtb SV pan-genome graph. (A) Left: Circle representing the high-level view of the Mtb SV pan-genome graph. Right: Two bubble regions shown in detail. Bubble Region 20 is representative of regions with a simple insertion/deletion, containing a single SV node (186 bp) in gene pe4 (Rv0160c). Bubble region 309 is representative of a complex bubble region, containing 88 SV nodes (55 759 bp) spanning from gene plcC (Rv2349c) to ppe40 (Rv2356c). (B) Distribution of the number of SV nodes per bubble region. (C) Distribution of SV node length. (D) Hierarchical breakdown of Core and SV nodes in specific categories of interest, showing number of nodes and cumulative length.
Figure 3.
Figure 3.
Comparison of Mtb pan-genome predictions across different analysis parameters. (A) Comparison of the number of core and accessory genes estimated for the identical population of 151 Mtb isolates across all tested parameters: Assembly type (hybrid versus short-read), annotation pipeline (Bakta versus PGAP), and pan-genome software (Panaroo, Roary, PPanGGolin, and Pangene). Each data point represents a different set of gene clustering parameters of the specific software. (B) Number of predicted CDS features annotated by Bakta and PGAP across all hybrid Mtb genomes. (C) Number of predicted pseudogene features annotated by Bakta and PGAP across all hybrid Mtb genomes.
Figure 4.
Figure 4.
Pan-genome tool comparison across three different bacterial species (A–C) Core and accessory genome estimates for Mtb, Eco, and Sau datasets across all tested parameters: Assembly type (hybrid versus short-read), and pan-genome software (Panaroo, Roary, PPanGGolin, and Pangene). Each data point represents a different set of gene clustering parameters of the specific software. (D) Percentage of gene absences due to CDS annotation discrepancy across Mtb, Eco, Sau. Each data point represents a different set of gene clustering parameters for Panaroo or Roary.
Figure 5.
Figure 5.
Overview of the panqc nucleotide correction pipeline and panqc adjustment of Mtb and Eco pan-genome estimates. (A) Diagram of the panqc algorithm: In Step1, all predicted gene absences making up the predicted accessory genome are identified. In Step 2, each absent gene’s nucleotide sequence is aligned against all genomes. In Step 3, alignments are analyzed to identify if the nucleotide sequence is still present despite the previously predicted absence. In Step 4, all genes are clustered based on the similarity of their nucleotide sequences. In Step 5, pan-genome estimates are readjusted accounting for presence/absence of nucleotide sequence. (B) Comparison of Panaroo and Roary pan-genome predictions before and after panqc re-adjustment with default parameters for Mtb and Eco datasets, for both hybrid and short-read assemblies. Each data point represents a different set of gene clustering parameters for Panaroo or Roary before or after panqc adjustment.

Update of

Similar articles

Cited by

References

    1. Ates LS. New insights into the mycobacterial PE and PPE proteins provide a framework for future research. Mol Microbiol 2020;113:4–21. 10.1111/mmi.14409 - DOI - PMC - PubMed
    1. Banu S, Honoré N, Saint-Joanis B et al. Are the PE-PGRS proteins of Mycobacterium tuberculosis variable surface antigens? Role of PE-PGRS proteins of M. tuberculosis. Mol Microbiol 2002;44:9–19. - PubMed
    1. Behruznia M, Marin M, Farhat MR et al. The Mycobacterium tuberculosis complex pangenome is small and driven by sub-lineage-specific regions of difference. eLife 2024;13:RP97870. 10.7554/eLife.97870.1 . - DOI
    1. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 2014;30:2114–20. - PMC - PubMed
    1. Boritsch EC, Khanna V, Pawlik A et al. Key experimental evidence of chromosomal DNA transfer among selected tuberculosis-causing mycobacteria. Proc Natl Acad Sci USA 2016;113:9876–81. - PMC - PubMed