Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 May;46(5):836-847.
doi: 10.1111/acer.14807. Epub 2022 May 16.

Using Facebook language to predict and describe excessive alcohol use

Affiliations

Using Facebook language to predict and describe excessive alcohol use

Rupa Jose et al. Alcohol Clin Exp Res. 2022 May.

Abstract

Background: Assessing risk for excessive alcohol use is important for applications ranging from recruitment into research studies to targeted public health messaging. Social media language provides an ecologically embedded source of information for assessing individuals who may be at risk for harmful drinking.

Methods: Using data collected on 3664 respondents from the general population, we examine how accurately language used on social media classifies individuals as at-risk for alcohol problems based on Alcohol Use Disorder Identification Test-Consumption score benchmarks.

Results: We find that social media language is moderately accurate (area under the curve = 0.75) at identifying individuals at risk for alcohol problems (i.e., hazardous drinking/alcohol use disorders) when used with models based on contextual word embeddings. High-risk alcohol use was predicted by individuals' usage of words related to alcohol, partying, informal expressions, swearing, and anger. Low-risk alcohol use was predicted by individuals' usage of social, affiliative, and faith-based words.

Conclusions: The use of social media data to study drinking behavior in the general public is promising and could eventually support primary and secondary prevention efforts among Americans whose at-risk drinking may have otherwise gone "under the radar."

Keywords: excessive alcohol use; natural language processing; social media; subclinical drinking.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.. AUDIT-C ROC curves
ROC curve for the best preforming AUDIT-C classification model – the contextual embeddings model – in dark blue. For comparison, the ROC curve using the 10-item Cohen’s perceived stress measure is shown in light blue. A standard reference line is shown by the solid vertical gray line.
Figure 2.
Figure 2.. General word clouds comparing high risk and low risk drinkers
Word clouds showing the most correlated and frequent words and phrases used by individuals who are at-risk for AUDs/hazardous drinking (i.e., high risk drinkers) and those who engage in low risk drinking. Font size is indicative of correlation strength (i.e., larger words are more correlated with our drinking outcome) whereas font color symbolizes word frequency in which high frequency words are in red, moderate frequency words are in blue, and low frequency words are in gray. Word clouds were generated using a frequency occurrence filter set at 0.1 (only uses the words/phrases which occur at least 10% of the time), a pmi or pointwise mutual information of 3.0 (filters phrases/multigram features based on how commonly they appear), and by only selecting individuals whose posts have used at least 1,000 words. This yielded the above word clouds which were based on 10,904 language features collected from a total of 3,392 individuals.
Figure 3.
Figure 3.. Top LIWC categories and LDA topics for high risk and low risk drinkers
Top five LIWC categories and LDA topics associated with at-risk AUDs/hazardous drinking (i.e., high risk drinking; blue clouds) or low risk drinking behaviors (red clouds). Font size indicates the relative prevalence of the word within the category or topic. Categories and topics are presented in descending order (i.e., strongest correlations first). At the bottom of each column, the correlation range and p-values for the presented categories/topics are noted. If the correlation magnitude were identical between two stacked categories or topics, they were marked with an asterisk (*). LIWC categories were based on 73 features, LDA topics based on 2000 features (all estimated on the message data of 3,392 individuals). Controlling for age and gender did not result in substantive changes in any of the LDA topics and only resulted in one LIWC category shift for “Low Risk Drinking.” That is, with the additional controls, LIWC’s function category (including words like “the”, “to”, “I”, “and”, “a”, and “you”) nudged out affiliation as a top five category although affiliation remained statistically significant (p = 0.0021).

Similar articles

Cited by

References

    1. Aertgeerts B, Buntinx F, Ansoms S, Fevery J (2001) Screening properties of questionnaires and laboratory tests for the detection of alcohol abuse or dependence in a general practice population. Br J Gen Pract 51:206–217. - PMC - PubMed
    1. Akbik A, Bergmann T, Vollgraf R (2019) Pooled contextualized embeddings for named entity recognition, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 724–728.
    1. Baan R, Straif K, Grosse Y, Secretan B, El Ghissassi F, Bouvard V, Altieri A, Cogliano V (2007) Carcinogenicity of alcoholic beverages. Lancet Oncol 8:292–293. - PubMed
    1. Blank G, Lutz C (2017) Representativeness of social media in Great Britain: Investigating Facebook, LinkedIn, Twitter, Pinterest, Google+, and Instagram. Am Behav Sci 61:741–756.
    1. Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet Allocation. J Mach Learn Res 3:993–1022.