Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Oct 18;12 Suppl 10(Suppl 10):S11.
doi: 10.1186/1471-2105-12-S10-S11.

Mining FDA drug labels using an unsupervised learning technique--topic modeling

Affiliations

Mining FDA drug labels using an unsupervised learning technique--topic modeling

Halil Bisgin et al. BMC Bioinformatics. .

Abstract

Background: The Food and Drug Administration (FDA) approved drug labels contain a broad array of information, ranging from adverse drug reactions (ADRs) to drug efficacy, risk-benefit consideration, and more. However, the labeling language used to describe these information is free text often containing ambiguous semantic descriptions, which poses a great challenge in retrieving useful information from the labeling text in a consistent and accurate fashion for comparative analysis across drugs. Consequently, this task has largely relied on the manual reading of the full text by experts, which is time consuming and labor intensive.

Method: In this study, a novel text mining method with unsupervised learning in nature, called topic modeling, was applied to the drug labeling with a goal of discovering "topics" that group drugs with similar safety concerns and/or therapeutic uses together. A total of 794 FDA-approved drug labels were used in this study. First, the three labeling sections (i.e., Boxed Warning, Warnings and Precautions, Adverse Reactions) of each drug label were processed by the Medical Dictionary for Regulatory Activities (MedDRA) to convert the free text of each label to the standard ADR terms. Next, the topic modeling approach with latent Dirichlet allocation (LDA) was applied to generate 100 topics, each associated with a set of drugs grouped together based on the probability analysis. Lastly, the efficacy of the topic modeling was evaluated based on known information about the therapeutic uses and safety data of drugs.

Results: The results demonstrate that drugs grouped by topics are associated with the same safety concerns and/or therapeutic uses with statistical significance (P<0.05). The identified topics have distinct context that can be directly linked to specific adverse events (e.g., liver injury or kidney injury) or therapeutic application (e.g., antiinfectives for systemic use). We were also able to identify potential adverse events that might arise from specific medications via topics.

Conclusions: The successful application of topic modeling on the FDA drug labeling demonstrates its potential utility as a hypothesis generation means to infer hidden relationships of concepts such as, in this study, drug safety and therapeutic use in the study of biomedical documents.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Overview of the workflow. The MedDRA ontology was applied to the three drug labeling sections (i.e., Boxed Warnings, Warnings and Precautions, and Adverse Reactions) to generate a list of adverse event terms for each drug, on which topic modeling was applied, followed with statistical analysis to assess the identified topics in the context of safety concern and therapeutic use.
Figure 2
Figure 2
The distribution of the number of drugs in the 100 topics. The cutoff for topics to perform further analysis on was set at 10 drugs and is shown on the graph.
Figure 3
Figure 3
The percentage of drugs with Boxed Warning (BW) for 27 topics. This percentage was calculated for each of 27 topics that contain at least 10 drugs.
Figure 4
Figure 4
The purity of the top therapeutic category for 27 topics. Each of 27 topics was assigned to one therapeutic category according to which ATC category contained the most drugs from that topic; the percent of drugs belonging to that category from the topic is shown.

References

    1. Baeza-Yates R, Ribeiro-Neto. B. Modern Information Retrieval. New York: ACM Press; 1999.
    1. Swanson DR. Fish oil, Raynaud's syndrome, and undiscovered public knowledge. Perspectives in Biology and Medicine. 1986;30(1):7–18. - PubMed
    1. Salton G, McGill MJ. Introduction to Modern Information Retrieval. McGraw-Hill; 1983.
    1. Gordon MD, Lindsay RK. Toward discovery support systems: a replication, re-examination, and extension of Swanson's work on literature-based discovery of a connection between Raynaud's and fish oil. J Am Soc Inf Sci. 1996;47(2):116–128. doi: 10.1002/(SICI)1097-4571(199602)47:2<116::AID-ASI3>3.0.CO;2-1. - DOI
    1. Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R. Indexing by latent semantic analysis. J Am Soc Inf Sci. 1990;41(6):391–407. doi: 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9. - DOI

Publication types