. 2023 Oct 5;24(1):223.

doi: 10.1186/s13059-023-03071-z.

GET_PANGENES: calling pangenes from plant genome alignments confirms presence-absence variation

Bruno Contreras-Moreira^{1

2}, Shradha Saraf³, Guy Naamati³, Ana M Casas⁴, Sandeep S Amberkar⁵, Paul Flicek³, Andrew R Jones⁵, Sarah Dyer⁶

Affiliations

¹ European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK. bcontreras@eead.csic.es.
² Estación Experimental Aula Dei-CSIC, 50059, Zaragoza, Spain. bcontreras@eead.csic.es.
³ European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK.
⁴ Estación Experimental Aula Dei-CSIC, 50059, Zaragoza, Spain.
⁵ Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool, UK.
⁶ European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK. sdyer@ebi.ac.uk.

PMID: 37798615
PMCID: PMC10552430
DOI: 10.1186/s13059-023-03071-z

GET_PANGENES: calling pangenes from plant genome alignments confirms presence-absence variation

Bruno Contreras-Moreira et al. Genome Biol. 2023.

. 2023 Oct 5;24(1):223.

doi: 10.1186/s13059-023-03071-z.

Authors

Bruno Contreras-Moreira^{1

2}, Shradha Saraf³, Guy Naamati³, Ana M Casas⁴, Sandeep S Amberkar⁵, Paul Flicek³, Andrew R Jones⁵, Sarah Dyer⁶

Affiliations

¹ European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK. bcontreras@eead.csic.es.
² Estación Experimental Aula Dei-CSIC, 50059, Zaragoza, Spain. bcontreras@eead.csic.es.
³ European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK.
⁴ Estación Experimental Aula Dei-CSIC, 50059, Zaragoza, Spain.
⁵ Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool, UK.
⁶ European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK. sdyer@ebi.ac.uk.

PMID: 37798615
PMCID: PMC10552430
DOI: 10.1186/s13059-023-03071-z

Abstract

Crop pangenomes made from individual cultivar assemblies promise easy access to conserved genes, but genome content variability and inconsistent identifiers hamper their exploration. To address this, we define pangenes, which summarize a species coding potential and link back to original annotations. The protocol get_pangenes performs whole genome alignments (WGA) to call syntenic gene models based on coordinate overlaps. A benchmark with small and large plant genomes shows that pangenes recapitulate phylogeny-based orthologies and produce complete soft-core gene sets. Moreover, WGAs support lift-over and help confirm gene presence-absence variation. Source code and documentation: https://github.com/Ensembl/plant-scripts .

Keywords: Collinearity; Gene annotation; Pangene; Plant genome; Presence-absence variation; Whole genome alignment.

PubMed Disclaimer

Conflict of interest statement

Paul Flicek is a member of the Scientific Advisory Boards of Fabric Genomics, Inc. and Eagle Genomics, Ltd.

Figures

**Fig. 1**
Features of get_pangenes.pl. A Flowchart of the main tasks and deliverables of script *get_pangenes.pl*: cutting cDNA and CDS sequences (top), calling collinear genes (middle, panels B and C) and clustering (bottom, panel D). By default, only cDNA and CDS sequences longer than 100 bp are considered. Whole genome alignments (WGA) can be computed with minimap2 (default) or GSAlign, and the input genomes can optionally be split in chromosomes or have their long geneless regions (> 1 Mbp) masked. Resulting gene clusters contain all isoforms and are post-processed to produce pangene and percentage of conserved sequences (POCS) matrices, as well as to estimate pan-, soft-core-, and core-genomes. GSAlign also produces average nucleotide identity (ANI) matrices. Several tasks can be fine-tuned by customizing an array of parameters, of which alignment coverage is perhaps the most important. B WGA of genomes A and B produces BED-like files that are intersected with gene models from B. Intersected coordinates are then used to transform B gene models to the genomic space of A. Finally, overlapping A gene models on the same strand are defined as collinear genes. C Feature overlap is computed from WGAs and gene coordinates from source GFF files. When checking the overlap of A and B gene models, strandedness is required. Overlaps can also be estimated between gene models annotated in one assembly and matched genomic segments from others. D Making greedy clusters by merging pairs of collinear genes. This algorithm has a key parameter, the maximum distance (in genes) among sequences of the same species that go in a cluster (default = 5). Its effect is illustrated on the right side, where gene g34 is left unclustered for having too many intervening genes

**Fig. 2**
Aligned genomic region in chr1 of *Oryza nivara* (top) and *Oryza sativa* Japonica group cv. Nipponbare (bottom) as displayed in the Ensembl Plants browser. Genes on the forward strand ( >) are above contigs, whilst those in the negative strand ( <) are underneath. As a result of the genomic alignment, genes of *O. nivara* overlap with gene models from *O. sativa*. This evidence can be used to identify collinear genes that take equivalent positions in different genomes, as illustrated with gene models ONIVA01G00130 and Os01g0100500, which overlap over 2.4 kb (yellow rectangle). The example shows that overlapping gene models might share only some exons. The table below shows the collinear gene models identified based on minimap2 and GSAlign alignments, together with the corresponding overlapped base pairs

**Fig. 3**
Multiple alignment of protein sequences encoded in barley pangene cluster Horvu_MOREX_1H01G011400, produced with Clustalx. This cluster contains isoforms from 13 gene models, but none from genotype OUN333. The last sequence is encoded by a CDS sequence lifted-over from cultivar HOR3081 on the genome of OUN3, spanning 3 exons (exon boundaries are marked with asterisks. B, Patch GFF file with the coordinates of the exons lifted-over from gene model Horvu_3081_1H01G015200. The underlying CDS nucleotide sequence was aligned with 411 matches, no indels and no mismatches with check_evidence.pl -f

**Fig. 4**
Genomic context of pangene cluster HORVU.MOREX.r3.3HG0311160 (green arrows), which corresponds to barley locus HvOS2. The genome fragment on top corresponds to reference genome MorexV3 and the tracks below show collinear genes found in other barley assemblies and annotation sets. In this example, the BarkeBaRT2v18 gene is split in two partial models. Note that white gene models might not be collinear as they could be encoded in a different genome fragment. Figure generated with script check_evidence.pl and pyGenomeViz (https://github.com/moshi4/pyGenomeViz)

See this image and copyright information in PMC

References

1. Jayakodi M, Padmarasu S, Haberer G, Bonthala VS, Gundlach H, Monat C, et al. The barley pan-genome reveals the hidden legacy of mutation breeding. Nature. 2020;588(7837):284–289. doi: 10.1038/s41586-020-2947-8. - DOI - PMC - PubMed
1. Walkowiak S, Gao L, Monat C, Haberer G, Kassa MT, Brinton J, et al. Multiple wheat genomes reveal global variation in modern breeding. Nature. 2020;588(7837):277–283. doi: 10.1038/s41586-020-2961-x. - DOI - PMC - PubMed
1. Gordon SP, Contreras-Moreira B, Woods DP, Des Marais DL, Burgess D, Shu S, et al. Extensive gene content variation in the Brachypodium distachyon pan-genome correlates with population structure. Nat Commun. 2017;8(1):2184. doi: 10.1038/s41467-017-02292-8. - DOI - PMC - PubMed
1. Weisman CM, Murray AW, Eddy SR. Mixing genome annotation methods in a comparative analysis inflates the apparent number of lineage-specific genes. Curr Biol. 2022;32(12):2632–2639.e2. doi: 10.1016/j.cub.2022.04.085. - DOI - PMC - PubMed
1. Golicz AA, Batley J, Edwards D. Towards plant pangenomics. Plant Biotechnol J. 2016;14(4):1099–105. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

GET_PANGENES: calling pangenes from plant genome alignments confirms presence-absence variation

Affiliations

GET_PANGENES: calling pangenes from plant genome alignments confirms presence-absence variation

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous