Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Sep 16:2024.08.04.24311480.
doi: 10.1101/2024.08.04.24311480.

Foundational model aided automatic high-throughput drug screening using self-controlled cohort study

Affiliations

Foundational model aided automatic high-throughput drug screening using self-controlled cohort study

Shenbo Xu et al. medRxiv. .

Abstract

Background: Developing medicine from scratch to governmental authorization and detecting adverse drug reactions (ADR) have barely been economical, expeditious, and risk-averse investments. The availability of large-scale observational healthcare databases and the popularity of large language models offer an unparalleled opportunity to enable automatic high-throughput drug screening for both repurposing and pharmacovigilance.

Objectives: To demonstrate a general workflow for automatic high-throughput drug screening with the following advantages: (i) the association of various exposure on diseases can be estimated; (ii) both repurposing and pharmacovigilance are integrated; (iii) accurate exposure length for each prescription is parsed from clinical texts; (iv) intrinsic relationship between drugs and diseases are removed jointly by bioinformatic mapping and large language model - ChatGPT; (v) causal-wise interpretations for incidence rate contrasts are provided.

Methods: Using a self-controlled cohort study design where subjects serve as their own control group, we tested the intention-to-treat association between medications on the incidence of diseases. Exposure length for each prescription is determined by parsing common dosages in English free text into a structured format. Exposure period starts from initial prescription to treatment discontinuation. A same exposure length preceding initial treatment is the control period. Clinical outcomes and categories are identified using existing phenotyping algorithms. Incident rate ratios (IRR) are tested using uniformly most powerful (UMP) unbiased tests.

Results: We assessed 3,444 medications on 276 diseases on 6,613,198 patients from the Clinical Practice Research Datalink (CPRD), an UK primary care electronic health records (EHR) spanning from 1987 to 2018. Due to the built-in selection bias of self-controlled cohort studies, ingredients-disease pairs confounded by deterministic medical relationships are removed by existing map from RxNorm and nonexistent maps by calling ChatGPT. A total of 16,901 drug-disease pairs reveals significant risk reduction, which can be considered as candidates for repurposing, while a total of 11,089 pairs showed significant risk increase, where drug safety might be of a concern instead.

Conclusions: This work developed a data-driven, nonparametric, hypothesis generating, and automatic high-throughput workflow, which reveals the potential of natural language processing in pharmacoepidemiology. We demonstrate the paradigm to a large observational health dataset to help discover potential novel therapies and adverse drug effects. The framework of this study can be extended to other observational medical databases.

Keywords: drug repurposing; drug screening; incidence rate ratio; natural language processing; pharmacovigilance; self-controlled cohort study.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
Illustration of self-controlled cohort study design. An example studying the relationship of a drug-disease pair by incorporating all new drug-users into the cohort. Equal person-time are allocated to exposed period after initial prescription and to unexposed period before first treatment for each specific patient. Disease incidence can take place before unexposure starts, during unexposure, during exposure, after exposure ends, or never happens. This arrangement is replicated for available medications on possible diseases in the database.
Figure 2:
Figure 2:
Drug indication map from prodcode to medcode. Solid boxes reveal specific coding system while dashed boxes contain sources of maps between adjacent coding systems along with R packages for extraction. If R package in a dashed box is missing, then the source of map are in machine-readable format.

References

    1. Abadie A. (2005). Semiparametric difference-in-differences estimators. The review of economic studies, 72(1),1–19.
    1. Alfattni G., Peek N., Nenadic G., and Caskey F. (2022). Integrating text analytics and statistical modelling to analyse kidney transplant immune suppression medication in registry data. International Journal of Population Data Science, 1(1).
    1. Awuklu Y. (2021). getUMLS: Query the UMLS metathesaurus [Manual]. Retrieved from https://github.com/yvoawk/getUMLS/releases/tag/v0.1.0 (R package version 0.1.0)
    1. Bao Y., Kuang Z., Peissig P., Page D., and Willett R. (2017). Hawkes process modeling of adverse drug reactions with longitudinal observational data. In Machine learning for healthcare conference (Vol. 68, pp. 177–190).
    1. Bate A., Lindquist M., Edwards I. R., Olsson S., Orre R., Lansner A., and De Freitas R. M. (1998). A Bayesian neural network method for adverse drug reaction signal generation. European journal of clinical pharmacology, 54, 315–321. - PubMed

Publication types