Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Oct 25;20(1):519.
doi: 10.1186/s12859-019-3115-8.

GenGraph: a python module for the simple generation and manipulation of genome graphs

Affiliations

GenGraph: a python module for the simple generation and manipulation of genome graphs

Jon Mitchell Ambler et al. BMC Bioinformatics. .

Abstract

Background: As sequencing technology improves, the concept of a single reference genome is becoming increasingly restricting. In the case of Mycobacterium tuberculosis, one must often choose between using a genome that is closely related to the isolate, or one that is annotated in detail. One promising solution to this problem is through the graph based representation of collections of genomes as a single genome graph. Though there are currently a handful of tools that can create genome graphs and have demonstrated the advantages of this new paradigm, there still exists a need for flexible tools that can be used by researchers to overcome challenges in genomics studies.

Results: We present GenGraph, a Python toolkit and accompanying modules that use existing multiple sequence alignment tools to create genome graphs. Python is one of the most popular coding languages for the biological sciences, and by providing these tools, GenGraph makes it easier to experiment and develop new tools that utilise genome graphs. The conceptual model used is highly intuitive, and as much as possible the graph structure represents the biological relationship between the genomes. This design means that users will quickly be able to start creating genome graphs and using them in their own projects. We outline the methods used in the generation of the graphs, and give some examples of how the created graphs may be used. GenGraph utilises existing file formats and methods in the generation of these graphs, allowing graphs to be visualised and imported with widely used applications, including Cytoscape, R, and Java Script.

Conclusions: GenGraph provides a set of tools for generating graph based representations of sets of sequences with a simple conceptual model, written in the widely used coding language Python, and publicly available on Github.

Keywords: Genome; Graph; Module; Python; Toolkit.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Representation of repeats in the genome graph. a, Two sequences where sequence 2 contains 3 additional “ATG” repeats high-lighted in blue. b, GenGraph represents only differences, with node 1 representing both sequences, node 2 representing the additional repeats found only in sequence 1, and node 3 the sequence that is once again shared. c, This is opposed to an approach where the “ATG” repeat is represented as a single node with a self loop. This approach may be neater and result in better compression, but raises many practical problems including not allowing the node to be labeled with the sequence start and stop positions
Fig. 2
Fig. 2
Representation of inversions in the genome graph. During the first step of genome graph creation, co-linear blocks are identified. In some cases, these may be homologous sequences that have been inverted. GenGraph represents these sequences in a single node (that may be broken down into more nodes in the second step) and represents the inverted state of the sequence by negative nucleotide position values in the node. a, Two sequences are shown where an inversion has taken place. This is normally a larger stretch of sequence perhaps a few kb in length. b, The positions of the sequences are different, as is generally the case with homologous sequences. The positions of the nucleotides flanking the breakpoints are shown. c, The inversion in the second sequence is represented by reversed negative nucleotide position values. d, This way, both sequences are represented in the same node, and to recreate sequence 2, the sequence in the node is simply reverse-complimented
Fig. 3
Fig. 3
Overview of the GenGraph algorithm. a, Co-linear blocks of sequence are identified to determine the structural relationship of the sequences. b-c, Each block is then realigned using a MSA tool. c-d, Identical sequences are reduced into nodes and edges created
Fig. 4
Fig. 4
Plot of exported subgraph. a, Cytoscape allows for the styling of imported networks, and by mapping the node width to the sequence length it is simple to visualise which nodes represent insertions. Nodes can be coloured by which isolates they contain, in this case Beijing isolates were represented by red nodes, H37Rv by blue nodes, and purple nodes represent nodes shared by all isolates. b, For more detail on nodes of interest, a table listing the node and edge attributes is also available

References

    1. VG Team. Variant Graph. https://github.com/vgteam/vg/. Accessed 10 Dec 2018.
    1. Sheikhizadeh S, Schranz ME, Akdel M, de Ridder D, Smit S. PanTools: representation, storage and exploration of pan-genomic data. Bioinformatics. 2016; 32(17):487–93. 10.1093/bioinformatics/btw455. - PubMed
    1. Gonnella G, Kurtz S. GfaPy: A flexible and extensible software library for handling sequence graphs in Python. Bioinformatics. 2017; 33(19):3094–5. 10.1093/bioinformatics/btx398. - PubMed
    1. Paten B, Novak AM, Eizenga JM, Garrison E. Genome graphs and the evolution of genome inference. Genome Res. 2017; 27(5):665–76. 10.1101/gr.214155.116. - PMC - PubMed
    1. Darling AE, Mau B, Perna NT. Progressivemauve: Multiple genome alignment with gene gain, loss and rearrangement. PLoS ONE. 2010; 5(6). 10.1371/journal.pone.0011147. - PMC - PubMed