Mining patents with large language models elucidates the chemical function landscape
- PMID: 38873033
- PMCID: PMC11167698
- DOI: 10.1039/d4dd00011k
Mining patents with large language models elucidates the chemical function landscape
Abstract
The fundamental goal of small molecule discovery is to generate chemicals with target functionality. While this often proceeds through structure-based methods, we set out to investigate the practicality of methods that leverage the extensive corpus of chemical literature. We hypothesize that a sufficiently large text-derived chemical function dataset would mirror the actual landscape of chemical functionality. Such a landscape would implicitly capture complex physical and biological interactions given that chemical function arises from both a molecule's structure and its interacting partners. To evaluate this hypothesis, we built a Chemical Function (CheF) dataset of patent-derived functional labels. This dataset, comprising 631 K molecule-function pairs, was created using an LLM- and embedding-based method to obtain 1.5 K unique functional labels for approximately 100 K randomly selected molecules from their corresponding 188 K unique patents. We carry out a series of analyses demonstrating that the CheF dataset contains a semantically coherent textual representation of the functional landscape congruent with chemical structural relationships, thus approximating the actual chemical function landscape. We then demonstrate through several examples that this text-based functional landscape can be leveraged to identify drugs with target functionality using a model able to predict functional profiles from structure alone. We believe that functional label-guided molecular discovery may serve as an alternative approach to traditional structure-based methods in the pursuit of designing novel functional molecules.
This journal is © The Royal Society of Chemistry.
Conflict of interest statement
The authors report no conflict of interest.
Figures





Update of
-
Mining Patents with Large Language Models Elucidates the Chemical Function Landscape.ArXiv [Preprint]. 2023 Dec 18:arXiv:2309.08765v2. ArXiv. 2023. Update in: Digit Discov. 2024 May 7;3(6):1150-1159. doi: 10.1039/d4dd00011k. PMID: 38196747 Free PMC article. Updated. Preprint.
Similar articles
-
Mining Patents with Large Language Models Elucidates the Chemical Function Landscape.ArXiv [Preprint]. 2023 Dec 18:arXiv:2309.08765v2. ArXiv. 2023. Update in: Digit Discov. 2024 May 7;3(6):1150-1159. doi: 10.1039/d4dd00011k. PMID: 38196747 Free PMC article. Updated. Preprint.
-
ChemTables: a dataset for semantic classification on tables in chemical patents.J Cheminform. 2021 Dec 11;13(1):97. doi: 10.1186/s13321-021-00568-2. J Cheminform. 2021. PMID: 34895295 Free PMC article.
-
Annotated chemical patent corpus: a gold standard for text mining.PLoS One. 2014 Sep 30;9(9):e107477. doi: 10.1371/journal.pone.0107477. eCollection 2014. PLoS One. 2014. PMID: 25268232 Free PMC article.
-
Probiotic Formulations: A Patent Landscaping Using the Text Mining Approach.Curr Microbiol. 2022 Apr 9;79(5):152. doi: 10.1007/s00284-022-02836-2. Curr Microbiol. 2022. PMID: 35397006 Review.
-
Global research on artemisinin and its derivatives: Perspectives from patents.Pharmacol Res. 2020 Sep;159:105048. doi: 10.1016/j.phrs.2020.105048. Epub 2020 Jun 23. Pharmacol Res. 2020. PMID: 32590098 Free PMC article.
Cited by
-
Revealing Chemical Trends: Insights from Data-Driven Visualization and Patent Analysis in Exposomics Research.Environ Sci Technol Lett. 2024 Aug 30;11(10):1046-1052. doi: 10.1021/acs.estlett.4c00560. eCollection 2024 Oct 8. Environ Sci Technol Lett. 2024. PMID: 39399286 Free PMC article.
References
-
- Corso G., Stärk H., Jing B., Barzilay R. and Jaakkola T., International Conference on Learning Representations, arXiv, 2023, preprint, arXiv:2210.01776v2, 10.48550/ARXIV.2210.01776 - DOI
Grants and funding
LinkOut - more resources
Full Text Sources