Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2024 Sep 23;25(6):bbae588.
doi: 10.1093/bib/bbae588.

A gentle introduction to pangenomics

Affiliations
Review

A gentle introduction to pangenomics

Chelsea A Matthews et al. Brief Bioinform. .

Abstract

Pangenomes have emerged in response to limitations associated with traditional linear reference genomes. In contrast to a traditional reference that is (usually) assembled from a single individual, pangenomes aim to represent all of the genomic variation found in a group of organisms. The term 'pangenome' is currently used to describe multiple different types of genomic information, and limited language is available to differentiate between them. This is frustrating for researchers working in the field and confusing for researchers new to the field. Here, we provide an introduction to pangenomics relevant to both prokaryotic and eukaryotic organisms and propose a formalization of the language used to describe pangenomes (see the Glossary) to improve the specificity of discussion in the field.

Keywords: genomic variation; pangenome; presence–absence variation (PAV); reference bias.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Comparison of a traditional reference approach with a pangenomic approach. In the traditional reference approach, reads from some parts of the sample genome are not similar enough to align to the reference, and so, these regions are excluded from the comparison (the blue sequence on the left). In other areas, reads will align poorly (as in the case of the orange region on the left) and so are partially or poorly represented. On the right-hand side, we can see how a pangenomic approach results in a higher proportion of reads aligning [5–7], and so, a larger portion of the sample genome is able to be analysed.
Figure 2
Figure 2
Three types of pangenomic data structures. (A) Consider a collection of four genomic sequences—one reference genome and three other genomes from the same population. Coloured sections indicate regions in genomes a, b, and c that diverge from the reference. Genes are indicated by black shapes. (B) PAV pangenome. Genes found within a population are partitioned into two groups: the core genome, which includes genes present in all members of the population, and the accessory genome, which includes genes present in only some members of the population. (C) A representative sequence pangenome. A set of genomic sequences such that the bulk of sequence diversity from the population is represented without significant duplication. (D) A sequence-oriented pangenome graph. A graph structure composed of nodes (genomic sequence) and edges (arrows between the sequence). Specific paths through the graph correspond to haplotypes present in the population. Pangenome graphs may also be gene-oriented, in which case each node represents a gene and edges indicate gene adjacency in the input genomes (see Fig. 3 for more details).
Figure 3
Figure 3
Gene-oriented graphs. Genomes are assembled and annotated, and the amino acid sequence or the nucleotide sequence of all genes is extracted and clustered. Each cluster makes up a single node of the graph, and the lines between the nodes connect genes that are adjacent in the input genomes. The thicker the line, the larger the number of genomes that have these two genes adjacent. If we consider Genome 1 to be the reference genome in this example, then Genome 2 has genes L and M (in yellow) instead of the reference H and I (in red) while Genome 3 has an insertion of the genes N, O, and P (in green) between genes B and C.
Figure 4
Figure 4
Four methods for identifying NRR sequences. Adapted from [16] (https://creativecommons.org/licenses/by/4.0/). (A) Reads from all samples that don’t align to the selected reference genome are pooled and de novo assembled into NRR sequences. (B) For each sample, reads that don’t align to the reference genome are de novo assembled into contigs. All contigs are pooled and then clustered to remove redundant sequences. (C) Reads that don’t align to the reference genome are de novo assembled into contigs, and the reference genome is updated to include these contigs. This process is repeated iteratively for all samples with the reference genome growing incrementally. (D) All reads for each sample are de novo assembled into contigs. Contigs are aligned to the reference genome and all unaligned contigs are pooled for all samples. Clustering is then used to remove redundant sequences.
Figure 5
Figure 5
Three methods for constructing sequence-oriented pangenome graphs. (A) Variants are added to the graph as bubbles ordered along the reference sequence. (B) Multiple genomic sequences are aligned to each other by introducing spaces into their sequences so as to maximize the number of bases that match up at each location. (C) A de Bruijn graph is constructed by breaking all genomic sequences up into k-mers, creating nodes from all k-mers that appear at least once, and connecting nodes that overlap each other by k-1.
Figure 6
Figure 6
Open and closed pangenomes. As the number of genomes included in the pangenome increases, the total number of genes in the pangenome will either plateau (a closed pangenome) or will continue to increase so that the total number of genes for that species/population cannot be accurately estimated (an open pangenome).

References

    1. Saxena RK, Edwards D, Varshney RK. Structural variations in plant genomes. Brief Funct Genomics 2014;13:296–307. 10.1093/bfgp/elu016. - DOI - PMC - PubMed
    1. Ballouz S, Dobin A, Gillis JA. Is it time to change the reference genome? Genome Biol 2019;20:159. 10.1186/s13059-019-1774-4. - DOI - PMC - PubMed
    1. Gage JL, Vaillancourt B, Hamilton JP. et al. . Multiple maize reference genomes impact the identification of variants by genome-wide association study in a diverse inbred panel. Plant Genome 2019;12:180069. 10.3835/plantgenome2018.09.0069. - DOI - PubMed
    1. Huang L, Popic V, Batzoglou S. Short read alignment with populations of genomes. Bioinformatics 2013;29:i361–70. 10.1093/bioinformatics/btt215. - DOI - PMC - PubMed
    1. Hickey G, Monlong J, Ebler J. et al. . Pangenome graph construction from genome alignments with Minigraph-cactus. Nat Biotechnol 2024;42:663–73. 10.1038/s41587-023-01793-w. - DOI - PMC - PubMed