TwiMed: Twitter and PubMed Comparable Corpus of Drugs, Diseases, Symptoms, and Their Relations

Nestor Alvaro^{1

2}, Yusuke Miyao^{1

2}, Nigel Collier³

Affiliations

¹ National Institute of Informatics, Department of Informatics, Tokyo, Japan.
² The Graduate University for Advanced Studies (SOKENDAI), Kanagawa, Japan.
³ Faculty of Modern & Medieval Languages, Department of Theoretical and Applied Linguistics, University of Cambridge, Cambridge, United Kingdom.

PMID: 28468748
PMCID: PMC5438461
DOI: 10.2196/publichealth.6396

TwiMed: Twitter and PubMed Comparable Corpus of Drugs, Diseases, Symptoms, and Their Relations

Nestor Alvaro et al. JMIR Public Health Surveill. 2017.

. 2017 May 3;3(2):e24.

doi: 10.2196/publichealth.6396.

Authors

Nestor Alvaro^{1

2}, Yusuke Miyao^{1

2}, Nigel Collier³

Affiliations

¹ National Institute of Informatics, Department of Informatics, Tokyo, Japan.
² The Graduate University for Advanced Studies (SOKENDAI), Kanagawa, Japan.
³ Faculty of Modern & Medieval Languages, Department of Theoretical and Applied Linguistics, University of Cambridge, Cambridge, United Kingdom.

PMID: 28468748
PMCID: PMC5438461
DOI: 10.2196/publichealth.6396

Abstract

Background: Work on pharmacovigilance systems using texts from PubMed and Twitter typically target at different elements and use different annotation guidelines resulting in a scenario where there is no comparable set of documents from both Twitter and PubMed annotated in the same manner.

Objective: This study aimed to provide a comparable corpus of texts from PubMed and Twitter that can be used to study drug reports from these two sources of information, allowing researchers in the area of pharmacovigilance using natural language processing (NLP) to perform experiments to better understand the similarities and differences between drug reports in Twitter and PubMed.

Methods: We produced a corpus comprising 1000 tweets and 1000 PubMed sentences selected using the same strategy and annotated at entity level by the same experts (pharmacists) using the same set of guidelines.

Results: The resulting corpus, annotated by two pharmacists, comprises semantically correct annotations for a set of drugs, diseases, and symptoms. This corpus contains the annotations for 3144 entities, 2749 relations, and 5003 attributes.

Conclusions: We present a corpus that is unique in its characteristics as this is the first corpus for pharmacovigilance curated from Twitter messages and PubMed sentences using the same data selection and annotation strategies. We believe this corpus will be of particular interest for researchers willing to compare results from pharmacovigilance systems (eg, classifiers and named entity recognition systems) when using data from Twitter and from PubMed. We hope that given the comprehensive set of drug names and the annotated entities and relations, this corpus becomes a standard resource to compare results from different pharmacovigilance studies in the area of NLP.

Keywords: PubMed; Twitter; annotation; corpus; natural language processing; pharmacovigilance; text mining.

©Nestor Alvaro, Yusuke Miyao, Nigel Collier. Originally published in JMIR Public Health and Surveillance (http://publichealth.jmir.org), 03.05.2017.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: None declared.

Figures

**Figure 1**
Annotation pipeline. The initial number of raw sentences differed between twitter (165,489 tweets) and PubMed (29,435 sentences).

**Figure 2**
Sample with the annotation of a drug, a disease and the relation between these concepts in a sentence from Twitter.

**Figure 3**
Sample of an annotation where “duration” and “exemplification” attributes are used.

See this image and copyright information in PMC

References

1. Hakala K, Van Landeghem S, Salakoski T, Van de Peer Y, Ginter F. Application of the EVEX resource to event extraction and network construction: Shared Task entry and result analysis. BMC Bioinformatics; BioNLP Shared Task 2013; August 9, 2013; Sofia, Bulgaria. 2015. p. S3. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-1... - DOI - DOI - PMC - PubMed
1. Freifeld CC, Brownstein JS, Menone CM, Bao W, Filice R, Kass-Hout T, Dasgupta N. Drug Safety. New York, NY: Springer; 2014. Digital drug safety surveillance: monitoring pharmaceutical products in twitter; pp. 343–50. - PMC - PubMed
1. Bian J, Topaloglu U, Yu F. Towards large-scale twitter mining for drug-related adverse events. Proceedings of the 2012 international workshop on Smart health and wellbeing; October 29, 2012; Maui, HI. 2012. pp. 25–32. - DOI - PMC - PubMed
1. Sarker A, Ginn R, Nikfarjam A, O'Connor K, Smith K, Jayaraman S, Upadhaya T, Gonzalez G. Utilizing social media data for pharmacovigilance: a review. J Biomed Inform. 2015 Apr;54:202–12. doi: 10.1016/j.jbi.2015.02.004. http://linkinghub.elsevier.com/retrieve/pii/S1532-0464(15)00036-2 - DOI - PMC - PubMed
1. Nikfarjam A, Sarker A, O'Connor K, Ginn R, Gonzalez G. Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features. J Am Med Inform Assoc. 2015 May;22(3):671–81. doi: 10.1093/jamia/ocu041. http://europepmc.org/abstract/MED/25755127 - DOI - PMC - PubMed

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

TwiMed: Twitter and PubMed Comparable Corpus of Drugs, Diseases, Symptoms, and Their Relations

Affiliations

TwiMed: Twitter and PubMed Comparable Corpus of Drugs, Diseases, Symptoms, and Their Relations

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources

Other Literature Sources