Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Oct 5;24(1):223.
doi: 10.1186/s13059-023-03071-z.

GET_PANGENES: calling pangenes from plant genome alignments confirms presence-absence variation

Affiliations

GET_PANGENES: calling pangenes from plant genome alignments confirms presence-absence variation

Bruno Contreras-Moreira et al. Genome Biol. .

Abstract

Crop pangenomes made from individual cultivar assemblies promise easy access to conserved genes, but genome content variability and inconsistent identifiers hamper their exploration. To address this, we define pangenes, which summarize a species coding potential and link back to original annotations. The protocol get_pangenes performs whole genome alignments (WGA) to call syntenic gene models based on coordinate overlaps. A benchmark with small and large plant genomes shows that pangenes recapitulate phylogeny-based orthologies and produce complete soft-core gene sets. Moreover, WGAs support lift-over and help confirm gene presence-absence variation. Source code and documentation: https://github.com/Ensembl/plant-scripts .

Keywords: Collinearity; Gene annotation; Pangene; Plant genome; Presence-absence variation; Whole genome alignment.

PubMed Disclaimer

Conflict of interest statement

Paul Flicek is a member of the Scientific Advisory Boards of Fabric Genomics, Inc. and Eagle Genomics, Ltd.

Figures

Fig. 1
Fig. 1
Features of get_pangenes.pl. A Flowchart of the main tasks and deliverables of script get_pangenes.pl: cutting cDNA and CDS sequences (top), calling collinear genes (middle, panels B and C) and clustering (bottom, panel D). By default, only cDNA and CDS sequences longer than 100 bp are considered. Whole genome alignments (WGA) can be computed with minimap2 (default) or GSAlign, and the input genomes can optionally be split in chromosomes or have their long geneless regions (> 1 Mbp) masked. Resulting gene clusters contain all isoforms and are post-processed to produce pangene and percentage of conserved sequences (POCS) matrices, as well as to estimate pan-, soft-core-, and core-genomes. GSAlign also produces average nucleotide identity (ANI) matrices. Several tasks can be fine-tuned by customizing an array of parameters, of which alignment coverage is perhaps the most important. B WGA of genomes A and B produces BED-like files that are intersected with gene models from B. Intersected coordinates are then used to transform B gene models to the genomic space of A. Finally, overlapping A gene models on the same strand are defined as collinear genes. C Feature overlap is computed from WGAs and gene coordinates from source GFF files. When checking the overlap of A and B gene models, strandedness is required. Overlaps can also be estimated between gene models annotated in one assembly and matched genomic segments from others. D Making greedy clusters by merging pairs of collinear genes. This algorithm has a key parameter, the maximum distance (in genes) among sequences of the same species that go in a cluster (default = 5). Its effect is illustrated on the right side, where gene g34 is left unclustered for having too many intervening genes
Fig. 2
Fig. 2
Aligned genomic region in chr1 of Oryza nivara (top) and Oryza sativa Japonica group cv. Nipponbare (bottom) as displayed in the Ensembl Plants browser. Genes on the forward strand ( >) are above contigs, whilst those in the negative strand ( <) are underneath. As a result of the genomic alignment, genes of O. nivara overlap with gene models from O. sativa. This evidence can be used to identify collinear genes that take equivalent positions in different genomes, as illustrated with gene models ONIVA01G00130 and Os01g0100500, which overlap over 2.4 kb (yellow rectangle). The example shows that overlapping gene models might share only some exons. The table below shows the collinear gene models identified based on minimap2 and GSAlign alignments, together with the corresponding overlapped base pairs
Fig. 3
Fig. 3
Multiple alignment of protein sequences encoded in barley pangene cluster Horvu_MOREX_1H01G011400, produced with Clustalx. This cluster contains isoforms from 13 gene models, but none from genotype OUN333. The last sequence is encoded by a CDS sequence lifted-over from cultivar HOR3081 on the genome of OUN3, spanning 3 exons (exon boundaries are marked with asterisks. B, Patch GFF file with the coordinates of the exons lifted-over from gene model Horvu_3081_1H01G015200. The underlying CDS nucleotide sequence was aligned with 411 matches, no indels and no mismatches with check_evidence.pl -f
Fig. 4
Fig. 4
Genomic context of pangene cluster HORVU.MOREX.r3.3HG0311160 (green arrows), which corresponds to barley locus HvOS2. The genome fragment on top corresponds to reference genome MorexV3 and the tracks below show collinear genes found in other barley assemblies and annotation sets. In this example, the BarkeBaRT2v18 gene is split in two partial models. Note that white gene models might not be collinear as they could be encoded in a different genome fragment. Figure generated with script check_evidence.pl and pyGenomeViz (https://github.com/moshi4/pyGenomeViz)

References

    1. Jayakodi M, Padmarasu S, Haberer G, Bonthala VS, Gundlach H, Monat C, et al. The barley pan-genome reveals the hidden legacy of mutation breeding. Nature. 2020;588(7837):284–289. doi: 10.1038/s41586-020-2947-8. - DOI - PMC - PubMed
    1. Walkowiak S, Gao L, Monat C, Haberer G, Kassa MT, Brinton J, et al. Multiple wheat genomes reveal global variation in modern breeding. Nature. 2020;588(7837):277–283. doi: 10.1038/s41586-020-2961-x. - DOI - PMC - PubMed
    1. Gordon SP, Contreras-Moreira B, Woods DP, Des Marais DL, Burgess D, Shu S, et al. Extensive gene content variation in the Brachypodium distachyon pan-genome correlates with population structure. Nat Commun. 2017;8(1):2184. doi: 10.1038/s41467-017-02292-8. - DOI - PMC - PubMed
    1. Weisman CM, Murray AW, Eddy SR. Mixing genome annotation methods in a comparative analysis inflates the apparent number of lineage-specific genes. Curr Biol. 2022;32(12):2632–2639.e2. doi: 10.1016/j.cub.2022.04.085. - DOI - PMC - PubMed
    1. Golicz AA, Batley J, Edwards D. Towards plant pangenomics. Plant Biotechnol J. 2016;14(4):1099–105. - PMC - PubMed

Publication types

LinkOut - more resources