Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Aug 23:8:1490.
doi: 10.12688/f1000research.20233.1. eCollection 2019.

A Sequence Distance Graph framework for genome assembly and analysis

Affiliations

A Sequence Distance Graph framework for genome assembly and analysis

Luis Yanes et al. F1000Res. .

Abstract

The Sequence Distance Graph (SDG) framework works with genome assembly graphs and raw data from paired, linked and long reads. It includes a simple deBruijn graph module, and can import graphs using the graphical fragment assembly (GFA) format. It also maps raw reads onto graphs, and provides a Python application programming interface (API) to navigate the graph, access the mapped and raw data and perform interactive or scripted analyses. Its complete workspace can be dumped to and loaded from disk, decoupling mapping from analysis and supporting multi-stage pipelines. We present the design and implementation of the framework, and example analyses scaffolding a short read graph with long reads, and navigating paths in a heterozygous graph for a simulated parent-offspring trio dataset. SDG is freely available under the MIT license at https://github.com/bioinfologics/sdg.

Keywords: Genome graph; genome assembly.

PubMed Disclaimer

Conflict of interest statement

No competing interests were disclosed.

Figures

Figure 1.
Figure 1.. A simple Sequence Distance Graph with 5 nodes, including links with d<0, representing overlaps, and a link representing a gap of 10bp.
Sequences appear in only one direction and their reverse complement can be obtained by traversing the node in opposite direction, from - to +. The two largest possible paths are [1, 2, 4, 5] and [1, -3, 4, 5], and their reverse complements [-5, -4, -2, -1] and [-5, -4, 3, -1] respectively.
Figure 2.
Figure 2.. Structure of a WorkSpace and access via an Python interactive session.
The WorkSpace holds the information for a project and contains the graphs, the mappers and k-mer counts. From Python, a previously saved WorkSpace is loaded from disk (1). The NodeView object is centred on a specific node and can be used to access node characteristics (ie. size and sequence), graph topology from the perspective of the node you are on (i.e. neighbours in both directions (2)) and can also retrieve information projected onto the selected node (ie. mappings (3) and k-mer coverage (4)). Operations such as adding a KmerCounter to the WorkSpace and adding a count (5) can be performed, and the WorkSpace can be saved back to disk (6). Once loaded, the bulk of the WorkSpace is held in memory for fast access with the raw read data from the DataStores remaining on disk accessible through random access.
Figure 3.
Figure 3.
Linkage at different stages of the long read scaffolding example, visualised using Bandage: A) SequenceDistanceGraph generated by sdg-dbg from short reads, B) DistanceGraph generated after using make_nextselected_linkage on the long read data, linking all nodes of 1100bp and more, C) DistanceGraph after eliminating all nodes with multiple connections (repeats).
Figure 4.
Figure 4.. Trio analysis: k-mer coverage for each side of the largest bubble structure in the child’s assembly by each of the three read sets.
Coverage drops to 0 on the opposite parent for k-mers that are unique to a parent.

References

    1. Pevzner PA, Tang H, Waterman MS: An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci U S A. 2001;98(17):9748–9753. 10.1073/pnas.171285098 - DOI - PMC - PubMed
    1. Medvedev P, Brudno M: Maximum likelihood genome assembly. J Comput Biol. 2009;16(8):1101–1116. 10.1089/cmb.2009.0047 - DOI - PMC - PubMed
    1. Butler J, MacCallum I, Kleber M, et al. : ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 2008;18(5):810–820. 10.1101/gr.7337908 - DOI - PMC - PubMed
    1. Jackman SD, Myers EW, Gonella G: The GFA Specification. Reference Source
    1. Garrison E, Sirén J, Novak AM, et al. : Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol. 2018;36(9):875–879. 10.1038/nbt.4227 - DOI - PMC - PubMed

Publication types

LinkOut - more resources