Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Feb 5:13:5.
doi: 10.3389/fncir.2019.00005. eCollection 2019.

DVID: Distributed Versioned Image-Oriented Dataservice

Affiliations

DVID: Distributed Versioned Image-Oriented Dataservice

William T Katz et al. Front Neural Circuits. .

Abstract

Open-source software development has skyrocketed in part due to community tools like github.com, which allows publication of code as well as the ability to create branches and push accepted modifications back to the original repository. As the number and size of EM-based datasets increases, the connectomics community faces similar issues when we publish snapshot data corresponding to a publication. Ideally, there would be a mechanism where remote collaborators could modify branches of the data and then flexibly reintegrate results via moderated acceptance of changes. The DVID system provides a web-based connectomics API and the first steps toward such a distributed versioning approach to EM-based connectomics datasets. Through its use as the central data resource for Janelia's FlyEM team, we have integrated the concepts of distributed versioning into reconstruction workflows, allowing support for proofreader training and segmentation experiments through branched, versioned data. DVID also supports persistence to a variety of storage systems from high-speed local SSDs to cloud-based object stores, which allows its deployment on laptops as well as large servers. The tailoring of the backend storage to each type of connectomics data leads to efficient storage and fast queries. DVID is freely available as open-source software with an increasing number of supported storage options.

Keywords: EM reconstruction; big data; collaboration; connectomics; dataservice; datastore; distributed version control; versioning.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Key-value stores are among the simplest databases with few operations. Because of their simplicity, many storage systems can be mapped to key-value interfaces, including file systems where the file path is the key and the value is the file data.
Figure 2
Figure 2
High-level view of DVID. Data types within DVID provide a Science API to clients while transforming data to meet a primarily key-value Storage API or proxy data to a connectomics service.
Figure 3
Figure 3
Versioning can help train proofreaders without requiring any changes to proofreading tools. After full proofreading (version 8d65f), an interesting neuron is selected and its precursor at the root version c78a0 is assigned for training. Each trainee gets her own branch off the root version, and the reconstructed neuron (e.g., the one depicted in training version a6341) can be compared to version 8d65f.
Figure 4
Figure 4
The version DAG of the mushroom body reconstruction as seen through the DVID Console's DAG viewer. Snapshots show (A) zoomed out view showing extent of DAG with significant proofreader training branches near root, and (B) blown up view of leaf at bottom left. Green nodes highlight the “master” branch while the yellow leaf node is the current production version.
Figure 5
Figure 5
Each data type persists data using datatype-specific key-value pairs. Key-value pairs for two data instances are shown: a labelmap instance (data id 1) in blue and an annotation instance (data id 2) in red. The datatype-specific component of a key (TKey) could be a block coordinate for a block of voxels. DVID then wraps this TKey, prepending a short data instance identifier and appending a version identifier. A tombstone flag (T) can mark a key-value as deleted in a version without actually deleting earlier versions, as shown for the last key, which marks the deletion of annotations in block coordinate (23, 23, 10) in version 1.
Figure 6
Figure 6
Simple example of distribution of key-value pairs across the nodes of a DAG (only keys shown). In this example, segmentation and synapse data for a 6,4003 voxel volume with 1,000 labels is stored in labelmap (blue) and annotation (red) instances at the root version 8fc4. The majority of key-value pairs are ingested at the root and only modified key-value pairs need to be stored for later versions. Several mutation requests are shown with their modified key-value pairs.
Figure 7
Figure 7
Scalability of uncompressed grayscale image reads from Google Cloud Store backend. As the number of DVID servers increase, simultaneously requesting non-overlapping image subvolumes from a 16 TeraVoxel dataset, the throughput plateaus just below 1.2 Gigavoxels or 9.6 Gigabits per second. Servers were at the Janelia cluster with 16 real request threads per server, connecting to a Northern Virginia Google Cloud Store through a 10 Gigabits per second connection. The grayscale instance had only one version corresponding to the ingested image (8-bit/voxel) volume.
Figure 8
Figure 8
Typical EM reconstructions produce a version DAG with most changes toward the root and fewer, human-guided changes toward the leaf nodes. This means that the bulk of data will be committed and immutable.
Figure 9
Figure 9
As shown by software version control systems like git, distributed versioning is an effective workflow for sharing changes via pull requests. The figure depicts a future scenario where the root version at Janelia has been shared with remote collaborators. After changes at the remote site, a pull request is sent back.

References

    1. Al-Awami A. K., Beyer J., Haehn D., Kasthuri N., Lichtman J., Pfister H., et al. (2015). Neuroblocks - visual tracking of segmentation and proofreading for large connectomics projects. IEEE Trans. Visual. Comput. Graph. 22, 738–746. 10.1109/TVCG.2015.2467441 - DOI - PubMed
    1. Bhardwaj A., Bhattacherjee S., Chavan A., Deshpande A., Elmore A. J., Madden S., et al. (2014). DataHub: collaborative data science & dataset version management at scale. arXiv.org.
    1. Blischak J. D., Davenport E. R., Wilson G. (2016). A quick introduction to version control with git and GitHub. PLoS Comput. Biol. 12:e1004668. 10.1371/journal.pcbi.1004668 - DOI - PMC - PubMed
    1. Burns R., Roncal W. G., Kleissas D., Lillaney K., Manavalan P., Perlman E., et al. (2013). The open connectome project data cluster: scalable analysis and vision for high-throughput neuroscience. arXiv: 1306.3543. - PMC - PubMed
    1. Dutka L., Wrzeszcz M., Lichoń T., Slota R., Zemek K., Trzepla K., et al. (2015). Onedata - a step forward towards globalization of data access for computing infrastructures. Proc. Comput. Sci. 51, 2843–2847. 10.1016/j.procs.2015.05.445 - DOI

Publication types

LinkOut - more resources