Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES): a method for populating knowledge bases using zero-shot learning

Affiliations

¹ Biosystems Data Science, Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States.
² Institute of Data Science, Faculty of Science and Engineering, Maastricht University, 6200 MD Maastricht, The Netherlands.
³ Semanticly, Athens, Greece.
⁴ Robert Bosch LLC, Sunnyvale, CA 94085, United States.
⁵ Department of Biomedical Informatics, University of Colorado, Anschutz Medical Campus, Aurora, CO 80217, United States.
⁶ Berlin Institute of Health at Charité, 10178 Berlin, Germany.

PMID: 38383067
PMCID: PMC10924283
DOI: 10.1093/bioinformatics/btae104

Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES): a method for populating knowledge bases using zero-shot learning

J Harry Caufield et al. Bioinformatics. 2024.

. 2024 Mar 4;40(3):btae104.

doi: 10.1093/bioinformatics/btae104.

Affiliations

¹ Biosystems Data Science, Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States.
² Institute of Data Science, Faculty of Science and Engineering, Maastricht University, 6200 MD Maastricht, The Netherlands.
³ Semanticly, Athens, Greece.
⁴ Robert Bosch LLC, Sunnyvale, CA 94085, United States.
⁵ Department of Biomedical Informatics, University of Colorado, Anschutz Medical Campus, Aurora, CO 80217, United States.
⁶ Berlin Institute of Health at Charité, 10178 Berlin, Germany.

PMID: 38383067
PMCID: PMC10924283
DOI: 10.1093/bioinformatics/btae104

Abstract

Motivation: Creating knowledge bases and ontologies is a time consuming task that relies on manual curation. AI/NLP approaches can assist expert curators in populating these knowledge bases, but current approaches rely on extensive training data, and are not able to populate arbitrarily complex nested knowledge schemas.

Results: Here we present Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES), a Knowledge Extraction approach that relies on the ability of Large Language Models (LLMs) to perform zero-shot learning and general-purpose query answering from flexible prompts and return information conforming to a specified schema. Given a detailed, user-defined knowledge schema and an input text, SPIRES recursively performs prompt interrogation against an LLM to obtain a set of responses matching the provided schema. SPIRES uses existing ontologies and vocabularies to provide identifiers for matched elements. We present examples of applying SPIRES in different domains, including extraction of food recipes, multi-species cellular signaling pathways, disease treatments, multi-step drug mechanisms, and chemical to disease relationships. Current SPIRES accuracy is comparable to the mid-range of existing Relation Extraction methods, but greatly surpasses an LLM's native capability of grounding entities with unique identifiers. SPIRES has the advantage of easy customization, flexibility, and, crucially, the ability to perform new tasks in the absence of any new training data. This method supports a general strategy of leveraging the language interpreting capabilities of LLMs to assemble knowledge bases, assisting manual knowledge curation and acquisition while supporting validation with publicly-available databases and ontologies external to the LLM.

Availability and implementation: SPIRES is available as part of the open source OntoGPT package: https://github.com/monarch-initiative/ontogpt.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

**Figure 1.**
Example schema. Boxes denote classes and arrows denote attributes whose range are classes (compound attributes). Crows feet above boxes denote multivalued attributes. Attributes whose ranges are primitives or value sets are shown within each box. Here, the top level container class ‘Recipe’ is composed of a label, description, categories, steps, and ingredients. Steps and ingredients are further decomposed into food items, quantities, etc.

**Figure 2.**
Example of a portion of text to parse and a corresponding instantiation of the recipe schema from Fig. 1, using YAML syntax. Input text is truncated for brevity; the full input is available at https://github.com/monarch-initiative/ontogpt/blob/main/tests/input/cases/recipe-spaghetti.txt. In each attribute-value pair, the attribute is shown in bold, followed by a colon and then the value or values. For multivalued attributes, each list element value is indicated with a hyphen at the beginning of the line. Terminal elements that are value sets from ontologies and standards such as FOODON (Dooley *et al*. 2018), UCUM (Schadow *et al*. 1999), and DBPedia (Bizer *et al*. 2009) are shown here with their human-readable labels after the double-hash comment symbol. Dynamic elements are indicated via RDF blank node syntax (e.g. _:ChoppedOnion does not correspond to a named entity and serves as a placeholder).

**Figure 3.**
Overview of the SPIRES approach. A knowledge schema and text containing instances defined in the schema are processed by OntoGPT, yielding a query for GPT-3 or newer, accessed through the OpenAI API. OntoGPT parses the result, grounding extracted instances with specific entries and terms retrieved from queries of databases and ontologies where possible. The final product is a set of structured data (instances and relationship) in the shapes defined by the schema. Icons by user Khoirin from the Noun Project (https://thenounproject.com/besticon/).

**Figure 4.**
Flowchart depicting the SPIRES algorithm.

See this image and copyright information in PMC

References

1. Ateia S, Kruschwitz U. Is ChatGPT a biomedical expert? – exploring the Zero-Shot performance of current GPT models in biomedical tasks. In: CLEF 2023: Conference and Labs of the Evaluation Forum, Thessaloniki, Greece: CLEF Initiative, 2023.
1. Babaei Giglou H, D’Souza J, Auer S.. LLMs4OL: large language models for ontology learning. In: The Semantic Web – ISWC 2023. Switzerland: Springer Nature, 2023, 408–27. 10.1007/978-3-031-47240-4 - DOI
1. Bender EM, Gebru T, McMillan-Major A. et al. On the dangers of stochastic parrots: can language models be too big? In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, New York, NY, USA: Association for Computing Machinery, 2021, 610–23. ISBN 9781450383097. 10.1145/3442188.3445922 - DOI
1. Bizer C, Lehmann J, Kobilarov G. et al. DBpedia – a crystallization point for the web of data. J Web Semant 2009;7:154–65. 10.1016/j.websem.2009.07.002 - DOI
1. Brown EG, Wood L, Wood S.. The medical dictionary for regulatory activities (MedDRA). Drug Saf 1999;20:109–17. 10.2165/00002018-199920020-00002 - DOI - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES): a method for populating knowledge bases using zero-shot learning

Affiliations

Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES): a method for populating knowledge bases using zero-shot learning

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources