Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Nov 10:11:602030.
doi: 10.3389/fphar.2020.602030. eCollection 2020.

A Novel Text-Mining Approach for Retrieving Pharmacogenomics Associations From the Literature

Affiliations

A Novel Text-Mining Approach for Retrieving Pharmacogenomics Associations From the Literature

Maria-Theodora Pandi et al. Front Pharmacol. .

Abstract

Text mining in biomedical literature is an emerging field which has already been shown to have a variety of implementations in many research areas, including genetics, personalized medicine, and pharmacogenomics. In this study, we describe a novel text-mining approach for the extraction of pharmacogenomics associations. The code that was used toward this end was implemented using R programming language, either through custom scripts, where needed, or through utilizing functions from existing libraries. Articles (abstracts or full texts) that correspond to a specified query were extracted from PubMed, while concept annotations were derived by PubTator Central. Terms that denote a Mutation or a Gene as well as Chemical compound terms corresponding to drug compounds were normalized and the sentences containing the aforementioned terms were filtered and preprocessed to create appropriate training sets. Finally, after training and adequate hyperparameter tuning, four text classifiers were created and evaluated (FastText, Linear kernel SVMs, XGBoost, Lasso, and Elastic-Net Regularized Generalized Linear Models) with regard to their performance in identifying pharmacogenomics associations. Although further improvements are essential toward proper implementation of this text-mining approach in the clinical practice, our study stands as a comprehensive, simplified, and up-to-date approach for the identification and assessment of research articles enriched in clinically relevant pharmacogenomics relationships. Furthermore, this work highlights a series of challenges concerning the effective application of text mining in biomedical literature, whose resolution could substantially contribute to the further development of this field.

Keywords: FastText, biomedical text classification, supervised learning; Pubmed; Pubtator; natural language processing; pharmacogenomics associations; text mining.

PubMed Disclaimer

Figures

FIGURE 1
FIGURE 1
Flowchart of the proposed automated text-mining approach and the validation steps for the retrieved literature relationships. PMIDs, PubMed identifiers; PMCIDs, PubMed Central identifiers (attributed to full-text articles only).
FIGURE 2
FIGURE 2
Presentation of the performance metrics, as calculated after using 10-fold Cross Validation with the training data, for all four models trained with sentences discussing one pair of Variant-Chemical (1-pair sentences).
FIGURE 3
FIGURE 3
Performance metrics, as calculated after using 10-fold Cross Validation with the training data, for all four models trained with sentences discussing multiple Variant-Chemical pairs (n-pair sentences). The resulting metrics are presented by model and by class, since this is a multiclass classification task, while finally, the by-class metrics for each model separately are weighted with the corresponding class prevalence and summed up to calculate the overall performance metrics.

References

    1. Benesty M. (2019). Fastrtext: ‘fastText’ wrapper for text classification and word representation. R Foundation for Statistical Computing.
    1. Chen T., Guestrin C. (2016). “XGBoost: a scalable tree boosting system,” in Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (San Francisco, CA: Association for Computing Machinery; ), 785–794.
    1. Dmitriy Selivanov M. B., Wang Q. (2020). text2vec: modern text mining framework for R. R Foundation for Statistical Computing.
    1. Friedman J., Hastie T., Tibshirani R. (2010). Regularization paths for generalized linear models via coordinate descent. J. Stat. Software 33 (1), 1–22. 10.18637/jss.v033.i01 - DOI - PMC - PubMed
    1. Garten Y., Altman R. B. (2009). Pharmspresso: a text mining tool for extraction of pharmacogenomic concepts and relationships from full text. BMC Bioinf. 10 (Suppl. 2), S6 10.1186/1471-2105-10-S2-S6 - DOI - PMC - PubMed

LinkOut - more resources