Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Aug 30:arXiv:2408.17320v1.

BioBricks.ai: A Versioned Data Registry for Life Sciences Data Assets

Affiliations

BioBricks.ai: A Versioned Data Registry for Life Sciences Data Assets

Yifan Gao et al. ArXiv. .

Update in

  • BioBricks.ai: a versioned data registry for life sciences data assets.
    Gao Y, Mughal Z, Jaramillo-Villegas JA, Corradi M, Borrel A, Lieberman B, Sharif S, Shaffer J, Fecho K, Chatrath A, Maertens A, Teunis MAT, Kleinstreuer N, Hartung T, Luechtefeld T. Gao Y, et al. Front Artif Intell. 2025 Aug 13;8:1599412. doi: 10.3389/frai.2025.1599412. eCollection 2025. Front Artif Intell. 2025. PMID: 40880880 Free PMC article.

Abstract

Researchers in biomedical research, public health and the life sciences often spend weeks or months discovering, accessing, curating, and integrating data from disparate sources, significantly delaying the onset of actual analysis and innovation. Instead of countless developers creating redundant and inconsistent data pipelines, BioBricks.ai offers a centralized data repository and a suite of developer-friendly tools to simplify access to scientific data. Currently, BioBricks.ai delivers over ninety biological and chemical datasets. It provides a package manager-like system for installing and managing dependencies on data sources. Each 'brick' is a Data Version Control git repository that supports an updateable pipeline for extraction, transformation, and loading data into the BioBricks.ai backend at https://biobricks.ai. Use cases include accelerating data science workflows and facilitating the creation of novel data assets by integrating multiple datasets into unified, harmonized resources. In conclusion, BioBricks.ai offers an opportunity to accelerate access and use of public data through a single open platform.

Keywords: BioBricks.ai; Bioinformatics; Cheminformatics; Data Integration; Machine Learning; Public Health Data.

PubMed Disclaimer

Conflict of interest statement

10.Conflict of Interest The authors declare the following potential conflicts of interest regarding the research and publication of this paper: BioBricks is a product developed by Insilica LLC, and many of the authors are employees of Insilica LLC. As such, there may be a perceived or real financial interest in the outcomes of the research and the development of BioBricks. The authors affirm that their contributions to the research and the manuscript were conducted with scientific integrity and without bias influenced by their association with Insilica LLC.

Figures

Figure 1
Figure 1
Top left - A code example to install, load, and analyze ToxRefDB data. Bottom left - the result of running the code example. Right - tabular data in bar chart form.
Figure 2
Figure 2
Left - truncated versions of the (1) .bb/dependencies.txt and (2) dvc.yaml file in the ChemHarmony BioBrick. Center, the 3-table schema of ChemHarmony, a simple chemical activities dataset with a substances, properties, and activities table. Right shows how to count activities by source by installing the ChemHarmony brick and using it with Apache Spark with the resulting table in lower right.

References

    1. Lin Z, Chou WC. Machine Learning and Artificial Intelligence in Toxicological Sciences. Toxicol Sci Off J Soc Toxicol. 2022. Aug 25;189(1):7–19. - PMC - PubMed
    1. Hartung T. Artificial intelligence as the new frontier in chemical risk assessment. Front Artif Intell [Internet]. 2023. Oct 17 [cited 2024 Jul 21];6. Available from: 10.3389/frai.2023.1269932/full - DOI - PMC - PubMed
    1. Luechtefeld T, Marsh D, Rowlands C, Hartung T. Machine Learning of Toxicological Big Data Enables Read-Across Structure Activity Relationships (RASAR) Outperforming Animal Test Reproducibility. Toxicol Sci. 2018. Sep 1;165(1):198–212. - PMC - PubMed
    1. Ramos MC, Collison CJ, White AD. A Review of Large Language Models and Autonomous Agents in Chemistry [Internet]. arXiv; 2024. [cited 2024 Jul 15]. Available from: http://arxiv.org/abs/2407.01603 - PMC - PubMed
    1. Fabian B, Edlich T, Gaspar H, Segler M, Meyers J, Fiscato M, et al. Molecular representation learning with language models and domain-relevant auxiliary tasks [Internet]. arXiv; 2020. [cited 2024 Jul 15]. Available from: http://arxiv.org/abs/2011.13230

Publication types

LinkOut - more resources