This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Aug 30:arXiv:2408.17320v1.

BioBricks.ai: A Versioned Data Registry for Life Sciences Data Assets

Yifan Gao¹, Zakariyya Mughal², Jose A Jaramillo-Villegas^{3

4}, Marie Corradi⁵, Alexandre Borrel⁶, Ben Lieberman², Suliman Sharif², John Shaffer², Karamarie Fecho^{7

8}, Ajay Chatrath⁹, Alexandra Maertens¹, Marc A T Teunis⁵, Nicole Kleinstreuer¹⁰, Thomas Hartung^{1

11}, Thomas Luechtefeld^{1

2}

Affiliations

¹ Center For Alternative to Animal Testing, Johns Hopkins University, Baltimore, MD, USA.
² Insilica, Bethesda, MD, USA.
³ Laboratory for Research in Complex Systems, Menlo Park, California, USA.
⁴ Facultad de Ingenierías, Universidad Tecnológica de Pereira, Pereira, Colombia.
⁵ Innovative Testing in Life Sciences & Chemistry, University of Applied Sciences Utrecht, Utrecht, The Netherlands.
⁶ Inotiv, Research Triangle Park, North Carolina, USA.
⁷ Renaissance Computing Institute, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA.
⁸ Copperline Professional Solutions, LLC, Pittsboro, NC, USA.
⁹ Department of Neurological Surgery, Washington University in Saint Louis, Saint Louis, Missouri.
¹⁰ NTP Interagency Center for the Evaluation of Alternative Methods, Research Triangle Park, NC, USA.
¹¹ University of Konstanz, Germany.

PMID: 39253636
PMCID: PMC11383443

BioBricks.ai: A Versioned Data Registry for Life Sciences Data Assets

Yifan Gao et al. ArXiv. 2024.

[Preprint]. 2024 Aug 30:arXiv:2408.17320v1.

Authors

Affiliations

¹ Center For Alternative to Animal Testing, Johns Hopkins University, Baltimore, MD, USA.
² Insilica, Bethesda, MD, USA.
³ Laboratory for Research in Complex Systems, Menlo Park, California, USA.
⁴ Facultad de Ingenierías, Universidad Tecnológica de Pereira, Pereira, Colombia.
⁵ Innovative Testing in Life Sciences & Chemistry, University of Applied Sciences Utrecht, Utrecht, The Netherlands.
⁶ Inotiv, Research Triangle Park, North Carolina, USA.
⁷ Renaissance Computing Institute, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA.
⁸ Copperline Professional Solutions, LLC, Pittsboro, NC, USA.
⁹ Department of Neurological Surgery, Washington University in Saint Louis, Saint Louis, Missouri.
¹⁰ NTP Interagency Center for the Evaluation of Alternative Methods, Research Triangle Park, NC, USA.
¹¹ University of Konstanz, Germany.

PMID: 39253636
PMCID: PMC11383443

Update in

BioBricks.ai: a versioned data registry for life sciences data assets.
Gao Y, Mughal Z, Jaramillo-Villegas JA, Corradi M, Borrel A, Lieberman B, Sharif S, Shaffer J, Fecho K, Chatrath A, Maertens A, Teunis MAT, Kleinstreuer N, Hartung T, Luechtefeld T. Gao Y, et al. Front Artif Intell. 2025 Aug 13;8:1599412. doi: 10.3389/frai.2025.1599412. eCollection 2025. Front Artif Intell. 2025. PMID: 40880880 Free PMC article.

Abstract

Researchers in biomedical research, public health and the life sciences often spend weeks or months discovering, accessing, curating, and integrating data from disparate sources, significantly delaying the onset of actual analysis and innovation. Instead of countless developers creating redundant and inconsistent data pipelines, BioBricks.ai offers a centralized data repository and a suite of developer-friendly tools to simplify access to scientific data. Currently, BioBricks.ai delivers over ninety biological and chemical datasets. It provides a package manager-like system for installing and managing dependencies on data sources. Each 'brick' is a Data Version Control git repository that supports an updateable pipeline for extraction, transformation, and loading data into the BioBricks.ai backend at https://biobricks.ai. Use cases include accelerating data science workflows and facilitating the creation of novel data assets by integrating multiple datasets into unified, harmonized resources. In conclusion, BioBricks.ai offers an opportunity to accelerate access and use of public data through a single open platform.

Keywords: BioBricks.ai; Bioinformatics; Cheminformatics; Data Integration; Machine Learning; Public Health Data.

PubMed Disclaimer

Conflict of interest statement

10.Conflict of Interest The authors declare the following potential conflicts of interest regarding the research and publication of this paper: BioBricks is a product developed by Insilica LLC, and many of the authors are employees of Insilica LLC. As such, there may be a perceived or real financial interest in the outcomes of the research and the development of BioBricks. The authors affirm that their contributions to the research and the manuscript were conducted with scientific integrity and without bias influenced by their association with Insilica LLC.

Figures

**Figure 1**
**Top left** - A code example to install, load, and analyze ToxRefDB data. **Bottom left** - the result of running the code example. **Right -** tabular data in bar chart form.

**Figure 2**
**Left** - truncated versions of the (1) .bb/dependencies.txt and (2) dvc.yaml file in the ChemHarmony BioBrick. **Center**, the 3-table schema of ChemHarmony, a simple chemical activities dataset with a substances, properties, and activities table. **Right** shows how to count activities by source by installing the ChemHarmony brick and using it with Apache Spark with the resulting table in lower right.

See this image and copyright information in PMC

References

1. Lin Z, Chou WC. Machine Learning and Artificial Intelligence in Toxicological Sciences. Toxicol Sci Off J Soc Toxicol. 2022. Aug 25;189(1):7–19. - PMC - PubMed
1. Hartung T. Artificial intelligence as the new frontier in chemical risk assessment. Front Artif Intell [Internet]. 2023. Oct 17 [cited 2024 Jul 21];6. Available from: 10.3389/frai.2023.1269932/full - DOI - PMC - PubMed
1. Luechtefeld T, Marsh D, Rowlands C, Hartung T. Machine Learning of Toxicological Big Data Enables Read-Across Structure Activity Relationships (RASAR) Outperforming Animal Test Reproducibility. Toxicol Sci. 2018. Sep 1;165(1):198–212. - PMC - PubMed
1. Ramos MC, Collison CJ, White AD. A Review of Large Language Models and Autonomous Agents in Chemistry [Internet]. arXiv; 2024. [cited 2024 Jul 15]. Available from: http://arxiv.org/abs/2407.01603 - PMC - PubMed
1. Fabian B, Edlich T, Gaspar H, Segler M, Meyers J, Fiscato M, et al. Molecular representation learning with language models and domain-relevant auxiliary tasks [Internet]. arXiv; 2020. [cited 2024 Jul 15]. Available from: http://arxiv.org/abs/2011.13230

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

BioBricks.ai: A Versioned Data Registry for Life Sciences Data Assets

Affiliations

BioBricks.ai: A Versioned Data Registry for Life Sciences Data Assets

Authors

Affiliations

Update in

Abstract

Conflict of interest statement

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources