Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jan 2:13:giae033.
doi: 10.1093/gigascience/giae033.

PEPhub: a database, web interface, and API for editing, sharing, and validating biological sample metadata

Affiliations

PEPhub: a database, web interface, and API for editing, sharing, and validating biological sample metadata

Nathan J LeRoy et al. Gigascience. .

Abstract

Background: As biological data increase, we need additional infrastructure to share them and promote interoperability. While major effort has been put into sharing data, relatively less emphasis is placed on sharing metadata. Yet, sharing metadata is also important and in some ways has a wider scope than sharing data themselves.

Results: Here, we present PEPhub, an approach to improve sharing and interoperability of biological metadata. PEPhub provides an API, natural-language search, and user-friendly web-based sharing and editing of sample metadata tables. We used PEPhub to process more than 100,000 published biological research projects and index them with fast semantic natural-language search. PEPhub thus provides a fast and user-friendly way to finding existing biological research data or to share new data.

Availability: https://pephub.databio.org.

Keywords: metadata API; metadata machine learning; metadata sharing; metadata validation.

PubMed Disclaimer

Conflict of interest statement

N.C.S. is a consultant for InVitro Cell Research, LLC. All other authors declare no competing interests.

Figures

Figure 1:
Figure 1:
PEPhub high-level architecture and project identification strategy. (A) PEPhub is backed by a Postgres database (left). It interfaces with the PEPhub server through a companion package called pepdbagent (middle). Web requests made by the web client or command-line interface are made via HTTP (right). (B) Workflow for automated GEO-to-PEPhub transfer using GEOfetch. We take advantage of scheduled GitHub Actions to automate new discovery of GEO accessions to upload. (C) PEPhub employs a {namespace}/{project}:{tag} nomenclature for sample table identification. Namespaces contain projects, which can be further distinguished with tags.
Figure 2:
Figure 2:
Metadata sharing, discovery, and accessibility features. (A) PEPhub can convert metadata into JSON, csv, and txt output. (B) Using a pretrained sentence transformer, we periodically compute low-dimensional embeddings of all PEPs in PEPhub by mining text descriptions from the metadata. The resulting embeddings are then stored in Qdrant: a vector similarity engine and vector database. These embeddings are then compared against user-submitted queries. (C) Searching for a PEP in pephub using vector search happens in 5 steps. First, the user submits a natural-language query. Second, this query is embedded in real time on the server. Third, the resultant vector is used to query Qdrant for nearest neighbors. Fourth, Qdrant responds with the most similar vectors it has stored. Finally, the hits are returned to the client submitting the query.
Figure 3:
Figure 3:
Metadata privacy and validation features. (A) Users have read access to all namespaces but write access only to their namespaces (left). Other users are not permitted to modify a PEP in any user namespace other than their own. PEPhub implements organizations through GitHub. Members of an organization are automatically granted write access to all PEPs that belong to that organization (right). (B) Validation on PEPhub is made easy with the integration of eido. PEPs in PEPub can be validated using the web-based validator UI, the metadata builder, or programmatic endpoints.
Figure 4:
Figure 4:
Metadata management comparison chart. PEPhub compares favorably to alternative metadata management systems. aWhile open source, no clear documentation exists for self-hosting an instance.

Update of

Similar articles

Cited by

References

    1. Volchenboum SL, Cox SM, Heath A et al. Data commons to support pediatric cancer research. In: American Society of Clinical Oncology Educational Book. 2017:746–52. 10.1200/EDBK_175029. - DOI - PubMed
    1. Bui AAT, Van Horn JD. Envisioning the future of ‘big data’ biomedicine. J Biomed Inform. 2017;69:115–17. 10.1016/j.jbi.2017.03.017. - DOI - PMC - PubMed
    1. Armit C, Tuli MA, Hunter CI. A decade of GigaScience: GigaDB and the open data movement. Gigascience. 2022;11:giac053. 10.1093/gigascience/giac053. - DOI - PMC - PubMed
    1. Xue B, Khoroshevskyi O, Gomez RA, et al. Opportunities and challenges in sharing and reusing genomic interval data. Front Genet. 2023;14:1155809. 10.3389/fgene.2023.1155809. - DOI - PMC - PubMed
    1. Wilkinson MD, Dumontier M, Aalbersberg IjJ et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3:160018. 10.1038/sdata.2016.18. - DOI - PMC - PubMed

Publication types