Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jun 27;38(13):3319-3326.
doi: 10.1093/bioinformatics/btac308.

ODGI: understanding pangenome graphs

Affiliations

ODGI: understanding pangenome graphs

Andrea Guarracino et al. Bioinformatics. .

Abstract

Motivation: Pangenome graphs provide a complete representation of the mutual alignment of collections of genomes. These models offer the opportunity to study the entire genomic diversity of a population, including structurally complex regions. Nevertheless, analyzing hundreds of gigabase-scale genomes using pangenome graphs is difficult as it is not well-supported by existing tools. Hence, fast and versatile software is required to ask advanced questions to such data in an efficient way.

Results: We wrote Optimized Dynamic Genome/Graph Implementation (ODGI), a novel suite of tools that implements scalable algorithms and has an efficient in-memory representation of DNA pangenome graphs in the form of variation graphs. ODGI supports pre-built graphs in the Graphical Fragment Assembly format. ODGI includes tools for detecting complex regions, extracting pangenomic loci, removing artifacts, exploratory analysis, manipulation, validation and visualization. Its fast parallel execution facilitates routine pangenomic tasks, as well as pipelines that can quickly answer complex biological questions of gigabase-scale pangenome graphs.

Availability and implementation: ODGI is published as free software under the MIT open source license. Source code can be downloaded from https://github.com/pangenome/odgi and documentation is available at https://odgi.readthedocs.io. ODGI can be installed via Bioconda https://bioconda.github.io/recipes/odgi/README.html or GNU Guix https://github.com/pangenome/odgi/blob/master/guix.scm.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Overview of the methods provided by ODGI (in black) and their supported input (in blue) and output (in red) data formats (A color version of this figure appears in the online version of this article.)
Fig. 2.
Fig. 2.
Visualizing the major histocompatibility complex (MHC) and complement component 4 (C4) pangenome graphs. (a) odgi draw layout of the MHC pangenome graph extracted from a whole human pangenome graph of 90 haplotypes. The red rectangle highlights the C4 region. (b–e) odgi viz visualizations of the C4 pangenome graph, where eight paths are displayed: two reference genomes (CHM13 and GRCh38 on the top) and six haplotypes of three diploid individuals. (b) odgi viz default modality: the image shows a quite linear graph. The links at the bottom indicate the presence of a structural variant (long link) with another structural variant nested inside it (short link on the left). (c) Color by path position. The top two reference genomes and one haplotypes (HG01952#2) go from left to right, while five haplotypes go in the opposite direction, as indicated by the black color on their left. (d) odgi viz color by strandness: the red paths indicate the haplotypes that were assembled in reverse with respect to the two reference genomes. (e) odgi viz color by node depth: using the Spectra color palette with four levels of node depths, white indicates no depth, while gray, red and yellow indicate depth 1, 2 and greater than or equal to 3, respectively. Coloring by node depth, we can see that the two references present two different allele copies of the C4 genes, both of them including the HERV sequence. The entirely gray paths have one copy of these genes. HG01071#2 presents three copies of the locus (orange), of which one contains the HERV sequence (gray in the middle of the orange). In HG01952#1, the HERV sequence is absent. (f) Layout of the C4 pangenome graph made with the Bandage tool (Wick et al., 2015) and annotated by using odgi position. Green nodes indicate the C4 genes (in red). The red rectangle highlights the regions where C4A and C4B genes differ. (g) Annotated Bandage layout of the C4 region where C4A and C4B genes differ due to single nucleotide variants leading to changes in the encoded protein sequences. Node labels were annoted by using odgi position. (h) Visualization of odgi untangle output in the C4 pangenome graph: the plots show the copy number status of the sequences in the C4 region with respect to the GRCh38 reference sequence, making clear, for example, that in HG00438#2, the C4A gene is missing (no black lines in the region annotated in red) (A color version of this figure appears in the online version of this article.)
Fig. 3.
Fig. 3.
Features of a 90-haplotype human pangenome graph of the exon 1 huntingtin gene (HTTexon1): (a) excerpt of vital statistics of the HTTexon1 graph displayed by MultiQC’s ODGI module. (b) Per nucleotide node degree distribution of CHM13 in the HTTexon1 graph. Around position 200 there is a huge variation in node degree. (c) Per nucleotide node depth distribution of CHM13 in the HTTexon1 graph. The alternating depth around position 200 indicates polymorphic variation complementing the above node degree analysis. (d) odgi viz visualization of the 23 largest gene alleles, CHM13 and GRCh38 of the HTTexon1 graph. (e) vg viz nucleotide-level visualization of 10 gene alleles, CHM13, GRCH38 of the HTTexon1 graph focusing on the CAG variable repeat region
Fig. 4.
Fig. 4.
Performance on a graph of human chromosome 6 from the HPRC. ODGI compares favorably to VG across all routine pangenomic tasks. Evaluations across threads were done using a 64 human haplotype graph. Evaluations across haplotypes were done using 16 threads. (a) Performance evaluation when translating a graph into the tools’ respective native formats. (b) Performance evaluation when extracting the centromeric region from the HPRC graph. (c) Performance evaluation when visualizing a graph. Both tools were run with only one thread. vg viz: *A 816 MB SVG was produced which cannot be opened by any program. **All produced SVGs only contain an XML header, nothing else

References

    1. Armstrong J. et al. (2020) Progressive cactus is a multiple-genome aligner for the thousand-genome era. Nature, 587, 246–251. - PMC - PubMed
    1. Baaijens J.A. et al. (2019) Full-length de novo viral quasispecies assembly through variation graph construction. Bioinformatics, 35, 5086–5094. - PubMed
    1. Ballouz S. et al. (2019) Is it time to change the reference genome? Genome Biol., 20, 159. - PMC - PubMed
    1. Bayer P.E. et al. (2020) Plant pan-genomes are the new reference. Nat. Plants, 6, 914–920. - PubMed
    1. Bayer P.E. et al. (2022) Wheat panache – a pangenome graph database representing presence/absence variation across 16 bread wheat genomes. bioRxiv. https://doi.org/10.1101/2022.02.23.481560. - PubMed

Publication types