. 2019 Feb 5:13:5.

doi: 10.3389/fncir.2019.00005. eCollection 2019.

DVID: Distributed Versioned Image-Oriented Dataservice

William T Katz¹, Stephen M Plaza¹

Affiliations

PMID: 30804760
PMCID: PMC6371063
DOI: 10.3389/fncir.2019.00005

DVID: Distributed Versioned Image-Oriented Dataservice

William T Katz et al. Front Neural Circuits. 2019.

. 2019 Feb 5:13:5.

doi: 10.3389/fncir.2019.00005. eCollection 2019.

Authors

William T Katz¹, Stephen M Plaza¹

Affiliation

¹ Janelia Research Campus, Howard Hughes Medical Institute, Ashburn, VA, United States.

PMID: 30804760
PMCID: PMC6371063
DOI: 10.3389/fncir.2019.00005

Abstract

Open-source software development has skyrocketed in part due to community tools like github.com, which allows publication of code as well as the ability to create branches and push accepted modifications back to the original repository. As the number and size of EM-based datasets increases, the connectomics community faces similar issues when we publish snapshot data corresponding to a publication. Ideally, there would be a mechanism where remote collaborators could modify branches of the data and then flexibly reintegrate results via moderated acceptance of changes. The DVID system provides a web-based connectomics API and the first steps toward such a distributed versioning approach to EM-based connectomics datasets. Through its use as the central data resource for Janelia's FlyEM team, we have integrated the concepts of distributed versioning into reconstruction workflows, allowing support for proofreader training and segmentation experiments through branched, versioned data. DVID also supports persistence to a variety of storage systems from high-speed local SSDs to cloud-based object stores, which allows its deployment on laptops as well as large servers. The tailoring of the backend storage to each type of connectomics data leads to efficient storage and fast queries. DVID is freely available as open-source software with an increasing number of supported storage options.

Keywords: EM reconstruction; big data; collaboration; connectomics; dataservice; datastore; distributed version control; versioning.

PubMed Disclaimer

Figures

**Figure 1**
Key-value stores are among the simplest databases with few operations. Because of their simplicity, many storage systems can be mapped to key-value interfaces, including file systems where the file path is the key and the value is the file data.

**Figure 2**
High-level view of DVID. Data types within DVID provide a Science API to clients while transforming data to meet a primarily key-value Storage API or proxy data to a connectomics service.

**Figure 3**
Versioning can help train proofreaders without requiring any changes to proofreading tools. After full proofreading (version 8d65f), an interesting neuron is selected and its precursor at the root version c78a0 is assigned for training. Each trainee gets her own branch off the root version, and the reconstructed neuron (e.g., the one depicted in training version a6341) can be compared to version 8d65f.

**Figure 4**
The version DAG of the mushroom body reconstruction as seen through the *DVID Console*'s DAG viewer. Snapshots show **(A)** zoomed out view showing extent of DAG with significant proofreader training branches near root, and **(B)** blown up view of leaf at bottom left. Green nodes highlight the “master” branch while the yellow leaf node is the current production version.

**Figure 5**
Each data type persists data using datatype-specific key-value pairs. Key-value pairs for two data instances are shown: a **labelmap** instance (data id 1) in blue and an **annotation** instance (data id 2) in red. The datatype-specific component of a key (**TKey**) could be a block coordinate for a block of voxels. DVID then wraps this **TKey**, prepending a short data instance identifier and appending a version identifier. A tombstone flag (T) can mark a key-value as deleted in a version without actually deleting earlier versions, as shown for the last key, which marks the deletion of annotations in block coordinate (23, 23, 10) in version 1.

**Figure 6**
Simple example of distribution of key-value pairs across the nodes of a DAG (only keys shown). In this example, segmentation and synapse data for a 6,400³ voxel volume with 1,000 labels is stored in **labelmap** (blue) and **annotation** (red) instances at the root version *8fc4*. The majority of key-value pairs are ingested at the root and only modified key-value pairs need to be stored for later versions. Several mutation requests are shown with their modified key-value pairs.

**Figure 7**
Scalability of uncompressed grayscale image reads from Google Cloud Store backend. As the number of DVID servers increase, simultaneously requesting non-overlapping image subvolumes from a 16 TeraVoxel dataset, the throughput plateaus just below 1.2 Gigavoxels or 9.6 Gigabits per second. Servers were at the Janelia cluster with 16 real request threads per server, connecting to a Northern Virginia Google Cloud Store through a 10 Gigabits per second connection. The grayscale instance had only one version corresponding to the ingested image (8-bit/voxel) volume.

**Figure 8**
Typical EM reconstructions produce a version DAG with most changes toward the root and fewer, human-guided changes toward the leaf nodes. This means that the bulk of data will be committed and immutable.

**Figure 9**
As shown by software version control systems like git, distributed versioning is an effective workflow for sharing changes via pull requests. The figure depicts a future scenario where the root version at Janelia has been shared with remote collaborators. After changes at the remote site, a pull request is sent back.

See this image and copyright information in PMC

References

1. Al-Awami A. K., Beyer J., Haehn D., Kasthuri N., Lichtman J., Pfister H., et al. (2015). Neuroblocks - visual tracking of segmentation and proofreading for large connectomics projects. IEEE Trans. Visual. Comput. Graph. 22, 738–746. 10.1109/TVCG.2015.2467441 - DOI - PubMed
1. Bhardwaj A., Bhattacherjee S., Chavan A., Deshpande A., Elmore A. J., Madden S., et al. (2014). DataHub: collaborative data science & dataset version management at scale. arXiv.org.
1. Blischak J. D., Davenport E. R., Wilson G. (2016). A quick introduction to version control with git and GitHub. PLoS Comput. Biol. 12:e1004668. 10.1371/journal.pcbi.1004668 - DOI - PMC - PubMed
1. Burns R., Roncal W. G., Kleissas D., Lillaney K., Manavalan P., Perlman E., et al. (2013). The open connectome project data cluster: scalable analysis and vision for high-throughput neuroscience. arXiv: 1306.3543. - PMC - PubMed
1. Dutka L., Wrzeszcz M., Lichoń T., Slota R., Zemek K., Trzepla K., et al. (2015). Onedata - a step forward towards globalization of data access for computing infrastructures. Proc. Comput. Sci. 51, 2843–2847. 10.1016/j.procs.2015.05.445 - DOI

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

HHMI/Howard Hughes Medical Institute/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

DVID: Distributed Versioned Image-Oriented Dataservice

Affiliation

DVID: Distributed Versioned Image-Oriented Dataservice

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources