Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Sep 12;20(9):e1012417.
doi: 10.1371/journal.pcbi.1012417. eCollection 2024 Sep.

An exploration into CTEPH medications: Combining natural language processing, embedding learning, in vitro models, and real-world evidence for drug repurposing

Affiliations

An exploration into CTEPH medications: Combining natural language processing, embedding learning, in vitro models, and real-world evidence for drug repurposing

Daniel Steiert et al. PLoS Comput Biol. .

Abstract

Background: In the modern era, the growth of scientific literature presents a daunting challenge for researchers to keep informed of advancements across multiple disciplines.

Objective: We apply natural language processing (NLP) and embedding learning concepts to design PubDigest, a tool that combs PubMed literature, aiming to pinpoint potential drugs that could be repurposed.

Methods: Using NLP, especially term associations through word embeddings, we explored unrecognized relationships between drugs and diseases. To illustrate the utility of PubDigest, we focused on chronic thromboembolic pulmonary hypertension (CTEPH), a rare disease with an overall limited number of scientific publications.

Results: Our literature analysis identified key clinical features linked to CTEPH by applying term frequency-inverse document frequency (TF-IDF) scoring, a technique measuring a term's significance in a text corpus. This allowed us to map related diseases. One standout was venous thrombosis (VT), which showed strong semantic links with CTEPH. Looking deeper, we discovered potential repurposing candidates for CTEPH through large-scale neural network-based contextualization of literature and predictive modeling on both the CTEPH and the VT literature corpora to find novel, yet unrecognized associations between the two diseases. Alongside the anti-thrombotic agent caplacizumab, benzofuran derivatives were an intriguing find. In particular, the benzofuran derivative amiodarone displayed potential anti-thrombotic properties in the literature. Our in vitro tests confirmed amiodarone's ability to reduce platelet aggregation significantly by 68% (p = 0.02). However, real-world clinical data indicated that CTEPH patients receiving amiodarone treatment faced a significant 15.9% higher mortality risk (p<0.001).

Conclusions: While NLP offers an innovative approach to interpreting scientific literature, especially for drug repurposing, it is crucial to combine it with complementary methods like in vitro testing and real-world evidence. Our exploration with benzofuran derivatives and CTEPH underscores this point. Thus, blending NLP with hands-on experiments and real-world clinical data can pave the way for faster and safer drug repurposing approaches, especially for rare diseases like CTEPH.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Identifying drug repurposing candidates for CTEPH.
PubDigest aims to automatically identify drug repurposing candidates from scientific literature. Approximately 3,600 abstracts referencing "chronic thromboembolic pulmonary hypertension" (CTEPH), a rare lung disease, were downloaded from the PubMed database. These abstracts were automatically analyzed to identify drugs used for, clinical features associated with, and diseases semantically linked to CTEPH. The CTEPH literature corpus was integrated with the corpus of venous thrombosis (VT), one of the semantically related conditions, consisting of around 82,000 published abstracts. By applying natural language processing (NLP) and embedding learning concepts, such as word embeddings and term associations, drugs from the VT field were identified that might have potential for repurposing in CTEPH. One of these identified candidates was followed up with in vitro and in silico experimentations and real-world data evaluation.
Fig 2
Fig 2. Program schematics of PubDigest, our NLP-centric tool for drug repurposing.
A) Non-technical program flow chart, providing a general overview. B) Detailed program flow chart, illustrating each module’s intricacies. PubDigest’s four modules are executed sequentially: data acquisition, cleaning, and processing; TF-IDF information gain scoring; word2vec term association; and data visualization. A user-supplied “Query Phrase”, in this case, “chronic thromboembolic pulmonary hypertension”, initiates the data acquisition from the PubMed scientific literature meta-data database, engaging the first two modules. The subsequent list of drug compounds and disease terms related to the query are ranked by relevance and frequency. An “Associated Disease” term (here “venous thrombosis”) chosen by the user can then be employed for further exploration of literature associations, triggering the repetition of the data acquisition, cleaning, and processing module and generating a second literature corpus. The third module uses the word2vec model to establish term relationships within the two literature corpora. This model can be queried to measure similarities to a user-defined “Prediction Term” (here “CTEPH”), revealing potential term relationships based on context. Lastly, the results are presented visually by the fourth and final module.
Fig 3
Fig 3. TF-IDF score highlights riociguat’s relevance in CTEPH research.
A) The most notable drugs in CTEPH literature are displayed in this word cloud, with font sizes representing the drug’s corpus-wide term frequency on a linear scale, accentuating the prominence of riociguat. Color and opacity do not convey information but enhance readability. B) The interval-weighted TF-IDF reflects the annual interest accrued by a specific compound within CTEPH research. This score is depicted using pseudo-colors, with each drug’s overall rank based on its corpus-wide TF-IDF score. Notably, in years with no publications mentioning a particular compound, corresponding fields in the visualization are left blank, indicating fading research or clinical interest.
Fig 4
Fig 4. TF-IDF analysis highlights venous thrombosis as a key disease associated with CTEPH.
A) This section displays the annual interest in specific clinical features of CTEPH, as measured by interval-weighted TF-IDF. Features are ranked based on their corpus-wide TF-IDF score. The TF-IDF values are depicted using pseudo-colors. Absence of data (no mentions) for any year leaves the corresponding field blank. B) The top twenty n-grams including a clinical feature identify the most frequently occurring disease terms in the CTEPH literature. The n-grams have been restricted to n = [2,3,4,5] and a frequency > 5.
Fig 5
Fig 5. Word2Vec reveals new links between venous thrombosis medications and CTEPH.
The model ranks drug compounds in relation to the term CTEPH based on cosine similarity and sorts them into direct and indirect term associations (heatmaps). The direct term associations refer to medications found in the CTEPH literature (right panels, dark blue marks their presence in the corpus). The indirectly associated terms were further processed to identify compounds with known pro- or anti-thrombotic properties. Of the 839 drug compounds contained in the embedding space, 162 direct associations have been identified, with the top 10 shown here. Of the 677 indirect term associations, 23 were labeled to have a thrombotic effect (10 anti-thrombotic and 13 pro-thrombotic), with the top 6 candidates shown per group.
Fig 6
Fig 6. Amiodarone was identified as a clinically used potential repurposing candidate for CTEPH.
A) Using benzarone as a lead compound, a structural similarity search identified three analogues, including amiodarone. B) ProTox-II predicted toxicity (probability, %) of these compounds, focusing on several toxicity targets and adverse outcome pathways (AOP), such as peroxisome proliferator activated receptor gamma (PPARG), antioxidant responsive element (ARE), heat shock factor response element (HSE), mitochondrial membrane potential (MtMP), and phosphoprotein (tumor suppressor) p53. C) Platelet adhesion to collagen IV surfaces using healthy donor blood at 20 mL/h perfusion for 5 minutes. Blood was pre-treated with either a vehicle (DMSO) or 5 μM benzarone for 20 minutes, followed by no stimulation or 50 μM ADP stimulation immediately prior to perfusion. D) Repetition of the adhesion assay with 5 μM amiodarone pre-treatment. Utilization of fresh blood from the same donors for C) and D) (n = 3). Platelets visualized by CD42b staining (gray; scale bar = 50 μm). Quantification displayed as percentage of CD42b-positive area relative to total image area, normalized against ADP-stimulated adhesion.

Similar articles

Cited by

References

    1. MEDLINE Citation Counts by Year of Publication (as of January 2023) [Internet]. May 27. 2023. Available from: https://www.nlm.nih.gov/bsd/medline_cit_counts_yr_pub.html
    1. Whiting P, Savović J, Higgins JPT, Caldwell DM, Reeves BC, Shea B, et al.. ROBIS: A new tool to assess risk of bias in systematic reviews was developed. J Clin Epidemiol. 2016. Jan;69:225–34. doi: 10.1016/j.jclinepi.2015.06.005 - DOI - PMC - PubMed
    1. Ernst P, Siu A, Weikum G. KnowLife: a versatile approach for constructing a large knowledge graph for biomedical sciences. BMC Bioinformatics. 2015. Dec 14;16(1):157. doi: 10.1186/s12859-015-0549-5 - DOI - PMC - PubMed
    1. Waagmeester A, Stupp G, Burgstaller-Muehlbacher S, Good BM, Griffith M, Griffith OL, et al.. Wikidata as a knowledge graph for the life sciences. Elife. 2020. Mar 17;9. doi: 10.7554/eLife.52614 - DOI - PMC - PubMed
    1. Bang D, Lim S, Lee S, Kim S. Biomedical knowledge graph learning for drug repurposing by extending guilt-by-association to multiple layers. Nat Commun. 2023. Jun 15;14(1):3570. doi: 10.1038/s41467-023-39301-y - DOI - PMC - PubMed

LinkOut - more resources