A gentle introduction to pangenomics

Chelsea A Matthews¹, Nathan S Watson-Haigh^{2

3

4}, Rachel A Burton¹, Anna E Sheppard⁵

Affiliations

¹ School of Agriculture, Food and Wine, Waite Campus, University of Adelaide, Urrbrae, South Australia 5064, Australia.
² Australian Genome Research Facility, Victorian Comprehensive Cancer Centre, Melbourne, Victoria 3000, Australia.
³ South Australian Genomics Centre, SAHMRI, North Terrace, Adelaide, South Australia 5000, Australia.
⁴ Alkahest Inc., San Carlos, CA 94070, United States.
⁵ School of Biological Sciences, University of Adelaide, Adelaide, South Australia 5005, Australia.

PMID: 39552065
PMCID: PMC11570541
DOI: 10.1093/bib/bbae588

Review

A gentle introduction to pangenomics

Chelsea A Matthews et al. Brief Bioinform. 2024.

. 2024 Sep 23;25(6):bbae588.

doi: 10.1093/bib/bbae588.

Authors

Chelsea A Matthews¹, Nathan S Watson-Haigh^{2

3

4}, Rachel A Burton¹, Anna E Sheppard⁵

Affiliations

¹ School of Agriculture, Food and Wine, Waite Campus, University of Adelaide, Urrbrae, South Australia 5064, Australia.
² Australian Genome Research Facility, Victorian Comprehensive Cancer Centre, Melbourne, Victoria 3000, Australia.
³ South Australian Genomics Centre, SAHMRI, North Terrace, Adelaide, South Australia 5000, Australia.
⁴ Alkahest Inc., San Carlos, CA 94070, United States.
⁵ School of Biological Sciences, University of Adelaide, Adelaide, South Australia 5005, Australia.

PMID: 39552065
PMCID: PMC11570541
DOI: 10.1093/bib/bbae588

Abstract

Pangenomes have emerged in response to limitations associated with traditional linear reference genomes. In contrast to a traditional reference that is (usually) assembled from a single individual, pangenomes aim to represent all of the genomic variation found in a group of organisms. The term 'pangenome' is currently used to describe multiple different types of genomic information, and limited language is available to differentiate between them. This is frustrating for researchers working in the field and confusing for researchers new to the field. Here, we provide an introduction to pangenomics relevant to both prokaryotic and eukaryotic organisms and propose a formalization of the language used to describe pangenomes (see the Glossary) to improve the specificity of discussion in the field.

Keywords: genomic variation; pangenome; presence–absence variation (PAV); reference bias.

PubMed Disclaimer

Figures

**Figure 1**
Comparison of a traditional reference approach with a pangenomic approach. In the traditional reference approach, reads from some parts of the sample genome are not similar enough to align to the reference, and so, these regions are excluded from the comparison (the blue sequence on the left). In other areas, reads will align poorly (as in the case of the orange region on the left) and so are partially or poorly represented. On the right-hand side, we can see how a pangenomic approach results in a higher proportion of reads aligning [5–7], and so, a larger portion of the sample genome is able to be analysed.

**Figure 2**
Three types of pangenomic data structures. (A) Consider a collection of four genomic sequences—one reference genome and three other genomes from the same population. Coloured sections indicate regions in genomes a, b, and c that diverge from the reference. Genes are indicated by black shapes. (B) PAV pangenome. Genes found within a population are partitioned into two groups: the core genome, which includes genes present in all members of the population, and the accessory genome, which includes genes present in only some members of the population. (C) A representative sequence pangenome. A set of genomic sequences such that the bulk of sequence diversity from the population is represented without significant duplication. (D) A sequence-oriented pangenome graph. A graph structure composed of nodes (genomic sequence) and edges (arrows between the sequence). Specific paths through the graph correspond to haplotypes present in the population. Pangenome graphs may also be gene-oriented, in which case each node represents a gene and edges indicate gene adjacency in the input genomes (see Fig. 3 for more details).

**Figure 3**
Gene-oriented graphs. Genomes are assembled and annotated, and the amino acid sequence or the nucleotide sequence of all genes is extracted and clustered. Each cluster makes up a single node of the graph, and the lines between the nodes connect genes that are adjacent in the input genomes. The thicker the line, the larger the number of genomes that have these two genes adjacent. If we consider Genome 1 to be the reference genome in this example, then Genome 2 has genes L and M (in yellow) instead of the reference H and I (in red) while Genome 3 has an insertion of the genes N, O, and P (in green) between genes B and C.

**Figure 4**
Four methods for identifying NRR sequences. Adapted from [16] (https://creativecommons.org/licenses/by/4.0/). (A) Reads from all samples that don’t align to the selected reference genome are pooled and *de novo* assembled into NRR sequences. (B) For each sample, reads that don’t align to the reference genome are *de novo* assembled into contigs. All contigs are pooled and then clustered to remove redundant sequences. (C) Reads that don’t align to the reference genome are *de novo* assembled into contigs, and the reference genome is updated to include these contigs. This process is repeated iteratively for all samples with the reference genome growing incrementally. (D) All reads for each sample are *de novo* assembled into contigs. Contigs are aligned to the reference genome and all unaligned contigs are pooled for all samples. Clustering is then used to remove redundant sequences.

**Figure 5**
Three methods for constructing sequence-oriented pangenome graphs. (A) Variants are added to the graph as bubbles ordered along the reference sequence. (B) Multiple genomic sequences are aligned to each other by introducing spaces into their sequences so as to maximize the number of bases that match up at each location. (C) A de Bruijn graph is constructed by breaking all genomic sequences up into k-mers, creating nodes from all k-mers that appear at least once, and connecting nodes that overlap each other by k-1.

**Figure 6**
Open and closed pangenomes. As the number of genomes included in the pangenome increases, the total number of genes in the pangenome will either plateau (a closed pangenome) or will continue to increase so that the total number of genes for that species/population cannot be accurately estimated (an open pangenome).

See this image and copyright information in PMC

References

1. Saxena RK, Edwards D, Varshney RK. Structural variations in plant genomes. Brief Funct Genomics 2014;13:296–307. 10.1093/bfgp/elu016. - DOI - PMC - PubMed
1. Ballouz S, Dobin A, Gillis JA. Is it time to change the reference genome? Genome Biol 2019;20:159. 10.1186/s13059-019-1774-4. - DOI - PMC - PubMed
1. Gage JL, Vaillancourt B, Hamilton JP. et al. . Multiple maize reference genomes impact the identification of variants by genome-wide association study in a diverse inbred panel. Plant Genome 2019;12:180069. 10.3835/plantgenome2018.09.0069. - DOI - PubMed
1. Huang L, Popic V, Batzoglou S. Short read alignment with populations of genomes. Bioinformatics 2013;29:i361–70. 10.1093/bioinformatics/btt215. - DOI - PMC - PubMed
1. Hickey G, Monlong J, Ebler J. et al. . Pangenome graph construction from genome alignments with Minigraph-cactus. Nat Biotechnol 2024;42:663–73. 10.1038/s41587-023-01793-w. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A gentle introduction to pangenomics

Affiliations

A gentle introduction to pangenomics

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources