Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 May 7;3(6):1150-1159.
doi: 10.1039/d4dd00011k. eCollection 2024 Jun 12.

Mining patents with large language models elucidates the chemical function landscape

Affiliations

Mining patents with large language models elucidates the chemical function landscape

Clayton W Kosonocky et al. Digit Discov. .

Abstract

The fundamental goal of small molecule discovery is to generate chemicals with target functionality. While this often proceeds through structure-based methods, we set out to investigate the practicality of methods that leverage the extensive corpus of chemical literature. We hypothesize that a sufficiently large text-derived chemical function dataset would mirror the actual landscape of chemical functionality. Such a landscape would implicitly capture complex physical and biological interactions given that chemical function arises from both a molecule's structure and its interacting partners. To evaluate this hypothesis, we built a Chemical Function (CheF) dataset of patent-derived functional labels. This dataset, comprising 631 K molecule-function pairs, was created using an LLM- and embedding-based method to obtain 1.5 K unique functional labels for approximately 100 K randomly selected molecules from their corresponding 188 K unique patents. We carry out a series of analyses demonstrating that the CheF dataset contains a semantically coherent textual representation of the functional landscape congruent with chemical structural relationships, thus approximating the actual chemical function landscape. We then demonstrate through several examples that this text-based functional landscape can be leveraged to identify drugs with target functionality using a model able to predict functional profiles from structure alone. We believe that functional label-guided molecular discovery may serve as an alternative approach to traditional structure-based methods in the pursuit of designing novel functional molecules.

PubMed Disclaimer

Conflict of interest statement

The authors report no conflict of interest.

Figures

Fig. 1
Fig. 1. Chemical function dataset creation. LLM extracts molecular functional information present in patents into concise labels; see Fig. S2 for an example. Chemical functional labels were then cleaned with algorithmic-, embedding-, and LLM-based methods.
Fig. 2
Fig. 2. Text-based functional labels cluster in structural space. For each of the labels “hcv”, “electroluminescence”, “serotonin”, and “5-HT”, molecules in the CheF dataset were mapped by their molecular fingerprints and colored based on whether the selected label was present in their set of functional descriptors. The max fingerprint Tanimoto similarity was computed between the fingerprint vectors of each molecule containing a given label and was compared against the max fingerprint Tanimoto similarity from a random equal-sized set of molecules to determine significance to a random control. Many of the labels strongly cluster in structural space, demonstrating that CheF accurately captures structure–function relationships. See Fig. S5 for examples with more labels.
Fig. 3
Fig. 3. Label co-occurrences reveal the text-based chemical function landscape. Node sizes correspond to number of connections, and edge sizes correspond to co-occurrence frequency in the CheF dataset. Modularity-based community detection was used to obtain 19 distinct communities. The communities broadly coincided with the semantic meaning of the contained labels, the largest 10 of which were summarized to representative categorical labels (Tables S4–S6†).
Fig. 4
Fig. 4. Coherence of the text-based chemical function landscape in structure space. To assess the alignment of text-based functional relationships with structural relationships, for each of the labels “hcv”, “electroluminescence”, “serotonin”, and “5-HT”, the max fingerprint Tanimoto similarity from each molecule containing a given label to each molecule containing any of its 10 most frequently co-occurring labels (<1000 total abundance) was compared against the max fingerprint Tanimoto similarity to a random subset of molecules of the same size. See Fig. S5 for examples with more labels.
Fig. 5
Fig. 5. Functional label-guided drug discovery. (a) Test set results from best-performing model that predicts functional labels from molecular fingerprints. Labels sorted by ROC-AUC, showing every 20 labels for clarity. Black line indicates the ROC-AUC random threshold. Average test ROC-AUC and PR-AUC were 0.84 and 0.20, respectively. (b) Model-based comprehensive annotation of chemical function. Shown is a test set molecule patented for hepatitis C antiviral treatment. The highly predicted ‘hcv’, ‘ns’ (nonstructural), and ‘inhibitor’ with the low-predicted ‘protease’ and ‘polymerase’ can be used to infer that the drug acts on NS5A to inhibit HCV replication, revealing a mechanism undisclosed in the patent. (c and d) Functional label-based drug candidate identification, showcasing the top 10 test set molecules for ‘serotonin’ or ‘5-HT’; true positives in green and false positives in red, determined if their associated patents mentioned serotonin or serotonin receptors. The false positives offer potential for drug discovery and repurposing, especially when considering these have patents for related neurological uses (i.e., anti-anxiety and memory dysfunction).

Update of

Similar articles

Cited by

References

    1. Li Q. Kang C. Int. J. Mol. Sci. 2020;21:5262. doi: 10.3390/ijms21155262. - DOI - PMC - PubMed
    1. Corso G., Stärk H., Jing B., Barzilay R. and Jaakkola T., International Conference on Learning Representations, arXiv, 2023, preprint, arXiv:2210.01776v2, 10.48550/ARXIV.2210.01776 - DOI
    1. Trott O. Olson A. J. J. Comput. Chem. 2009;32(2):455–461. doi: 10.1002/jcc.21334. - DOI - PMC - PubMed
    1. Wu Z. Ramsundar B. Feinberg E. N. Gomes J. Geniesse C. Pappu A. S. Leswing K. Pande V. Chem. Sci. 2018;9:513–530. doi: 10.1039/C7SC02664A. - DOI - PMC - PubMed
    1. Yang S.-Y. Drug Discovery Today. 2010;15:444–450. doi: 10.1016/j.drudis.2010.03.013. - DOI - PubMed

LinkOut - more resources