. 2020 Feb 26;22(2):e15861.

doi: 10.2196/15861.

Promoting Reproducible Research for Characterizing Nonmedical Use of Medications Through Data Annotation: Description of a Twitter Corpus and Guidelines

Karen O'Connor^#¹, Abeed Sarker^#², Jeanmarie Perrone³, Graciela Gonzalez Hernandez¹

Affiliations

¹ Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States.
² Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, United States.
³ Department of Emergency Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States.

^# Contributed equally.

PMID: 32130117
PMCID: PMC7066507
DOI: 10.2196/15861

Promoting Reproducible Research for Characterizing Nonmedical Use of Medications Through Data Annotation: Description of a Twitter Corpus and Guidelines

Karen O'Connor et al. J Med Internet Res. 2020.

. 2020 Feb 26;22(2):e15861.

doi: 10.2196/15861.

Authors

Karen O'Connor^#¹, Abeed Sarker^#², Jeanmarie Perrone³, Graciela Gonzalez Hernandez¹

Affiliations

¹ Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States.
² Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, United States.
³ Department of Emergency Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States.

^# Contributed equally.

PMID: 32130117
PMCID: PMC7066507
DOI: 10.2196/15861

Abstract

Background: Social media data are being increasingly used for population-level health research because it provides near real-time access to large volumes of consumer-generated data. Recently, a number of studies have explored the possibility of using social media data, such as from Twitter, for monitoring prescription medication abuse. However, there is a paucity of annotated data or guidelines for data characterization that discuss how information related to abuse-prone medications is presented on Twitter.

Objective: This study discusses the creation of an annotated corpus suitable for training supervised classification algorithms for the automatic classification of medication abuse-related chatter. The annotation strategies used for improving interannotator agreement (IAA), a detailed annotation guideline, and machine learning experiments that illustrate the utility of the annotated corpus are also described.

Methods: We employed an iterative annotation strategy, with interannotator discussions held and updates made to the annotation guidelines at each iteration to improve IAA for the manual annotation task. Using the grounded theory approach, we first characterized tweets into fine-grained categories and then grouped them into 4 broad classes-abuse or misuse, personal consumption, mention, and unrelated. After the completion of manual annotations, we experimented with several machine learning algorithms to illustrate the utility of the corpus and generate baseline performance metrics for automatic classification on these data.

Results: Our final annotated set consisted of 16,443 tweets mentioning at least 20 abuse-prone medications including opioids, benzodiazepines, atypical antipsychotics, central nervous system stimulants, and gamma-aminobutyric acid analogs. Our final overall IAA was 0.86 (Cohen kappa), which represents high agreement. The manual annotation process revealed the variety of ways in which prescription medication misuse or abuse is discussed on Twitter, including expressions indicating coingestion, nonmedical use, nonstandard route of intake, and consumption above the prescribed doses. Among machine learning classifiers, support vector machines obtained the highest automatic classification accuracy of 73.00% (95% CI 71.4-74.5) over the test set (n=3271).

Conclusions: Our manual analysis and annotations of a large number of tweets have revealed types of information posted on Twitter about a set of abuse-prone prescription medications and their distributions. In the interests of reproducible and community-driven research, we have made our detailed annotation guidelines and the training data for the classification experiments publicly available, and the test data will be used in future shared tasks.

Keywords: infodemiology; infoveillance; machine learning; natural language processing; prescription drug misuse; social media; substance abuse detection.

©Karen O'Connor, Abeed Sarker, Jeanmarie Perrone, Graciela Gonzalez Hernandez. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 26.02.2020.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: None declared.

Figures

**Figure 1**
Overview of the creation of the annotation guideline and the iterative annotation process.

**Figure 2**
Distribution of tweets in the annotated corpus by annotation category and drug class.

See this image and copyright information in PMC

References

1. Bennett WL. The Personalization of Politics. Ann Am Acad Pol Soc Sci. 2012;644(1):20–39. doi: 10.1177/0002716212451428. - DOI
1. Parganas P, Anagnostopoulos C, Chadwick S. 'You’ll never tweet alone': Managing sports brands through social media. J Brand Manag. 2015;22(7):551–68. doi: 10.1057/bm.2015.32. - DOI
1. Xu WW, Chiu I, Chen Y, Mukherjee T. Twitter hashtags for health: applying network and content analyses to understand the health knowledge sharing in a Twitter-based community of practice. Qual Quant. 2014;49(4):1361–80. doi: 10.1007/s11135-014-0051-6. - DOI
1. Kennedy B, Funk C. Pew Research Center. 2015. Dec 11, [2019-03-16]. Public Interest in Science and Health Linked to Gender, Age and Personality https://www.pewresearch.org/science/2015/12/11/public-interest-in-scienc...
1. Paul MJ, Dredze M, Broniatowski D. Twitter improves influenza forecasting. PLoS Curr. 2014 Oct 28;6:pii: ecurrents.outbreaks.90b9ed0f59bae4ccaa683a39865d9117. doi: 10.1371/currents.outbreaks.90b9ed0f59bae4ccaa683a39865d9117. doi: 10.1371/currents.outbreaks.90b9ed0f59bae4ccaa683a39865d9117. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

R01 DA046619/DA/NIDA NIH HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Promoting Reproducible Research for Characterizing Nonmedical Use of Medications Through Data Annotation: Description of a Twitter Corpus and Guidelines

Affiliations

Promoting Reproducible Research for Characterizing Nonmedical Use of Medications Through Data Annotation: Description of a Twitter Corpus and Guidelines

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources