Mining patents with large language models elucidates the chemical function landscape
- PMID: 38873033
- PMCID: PMC11167698
- DOI: 10.1039/d4dd00011k
Mining patents with large language models elucidates the chemical function landscape
Abstract
The fundamental goal of small molecule discovery is to generate chemicals with target functionality. While this often proceeds through structure-based methods, we set out to investigate the practicality of methods that leverage the extensive corpus of chemical literature. We hypothesize that a sufficiently large text-derived chemical function dataset would mirror the actual landscape of chemical functionality. Such a landscape would implicitly capture complex physical and biological interactions given that chemical function arises from both a molecule's structure and its interacting partners. To evaluate this hypothesis, we built a Chemical Function (CheF) dataset of patent-derived functional labels. This dataset, comprising 631 K molecule-function pairs, was created using an LLM- and embedding-based method to obtain 1.5 K unique functional labels for approximately 100 K randomly selected molecules from their corresponding 188 K unique patents. We carry out a series of analyses demonstrating that the CheF dataset contains a semantically coherent textual representation of the functional landscape congruent with chemical structural relationships, thus approximating the actual chemical function landscape. We then demonstrate through several examples that this text-based functional landscape can be leveraged to identify drugs with target functionality using a model able to predict functional profiles from structure alone. We believe that functional label-guided molecular discovery may serve as an alternative approach to traditional structure-based methods in the pursuit of designing novel functional molecules.
This journal is © The Royal Society of Chemistry.
Conflict of interest statement
The authors report no conflict of interest.
Figures
Update of
-
Mining Patents with Large Language Models Elucidates the Chemical Function Landscape.ArXiv [Preprint]. 2023 Dec 18:arXiv:2309.08765v2. ArXiv. 2023. Update in: Digit Discov. 2024 May 7;3(6):1150-1159. doi: 10.1039/d4dd00011k. PMID: 38196747 Free PMC article. Updated. Preprint.
References
-
- Corso G., Stärk H., Jing B., Barzilay R. and Jaakkola T., International Conference on Learning Representations, arXiv, 2023, preprint, arXiv:2210.01776v2, 10.48550/ARXIV.2210.01776 - DOI
Grants and funding
LinkOut - more resources
Full Text Sources