Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2021;11(1):102.
doi: 10.1007/s13278-021-00825-0. Epub 2021 Oct 20.

An augmented multilingual Twitter dataset for studying the COVID-19 infodemic

Affiliations
Review

An augmented multilingual Twitter dataset for studying the COVID-19 infodemic

Christian E Lopez et al. Soc Netw Anal Min. 2021.

Abstract

This work presents an openly available dataset to facilitate researchers' exploration and hypothesis testing about the social discourse of the COVID-19 pandemic. The dataset currently consists of over 2.2 billions tweets (count as of September, 2021), from all over the world, in multiple languages. Tweets start from January 22, 2020, when the total cases of reported COVID-19 were below 600 worldwide. The dataset was collected using the Twitter API and by rehydrating tweets from other available datasets, data collection is ongoing as of the time of writing. To facilitate hypothesis testing and exploration of social discourse, the English and Spanish tweets have been augmented with state-of-the-art Twitter Sentiment and Named Entity Recognition algorithms. The dataset and the summary files provided allow researchers to avoid some computationally intensive analyses, facilitating more widespread use of social media data to gain insights on issues such as (mis)information diffusion, semantic networks, sentiments, and the evolution of COVID-19 discussions. In addition, the dataset provides an archive for researchers in the social sciences wishing to have access to a dataset covering the entire duration of the pandemic.

Keywords: COVID-19; Named Entity Recognition; Sentiment analysis; Twitter.

PubMed Disclaimer

Conflict of interest statement

Conflicts of interestWe have no competing interests to declare.

Figures

Fig. 1
Fig. 1
Example of Tweet Related to COVID-19
Fig. 2
Fig. 2
Example of dataset tables
Fig. 3
Fig. 3
Tweet frequency across top five observed languages
Fig. 4
Fig. 4
Map of tweets featuring geolocation information
Fig. 5
Fig. 5
Sentiment of English-language tweets
Fig. 6
Fig. 6
Daily proportion of English-language tweets sentiment
Fig. 7
Fig. 7
Sentiment of Spanish tweets
Fig. 8
Fig. 8
Daily proportion of Spanish-language tweets by sentiment
Fig. 9
Fig. 9
Network generated from English tweets augmented dataset

Similar articles

Cited by

References

    1. Banda JM, Tekumalla R, Wang G, Yu J, Liu T, Ding Y, Artemova K, Tutubalina E, Chowell G. (2020) A large-scale COVID-19 Twitter chatter dataset for open scientific research - An international collaboration. https://zenodo.org/record/4065674#.X38ef9BKjb0 - PMC - PubMed
    1. Abd-Alrazaq A, Alhuwail D, Househ M, Hamdi M, Shah Z (2020) Top concerns of tweeters during the COVID-19 pandemic: infoveillance study. Journal of Medical Internet Research, 22(4). https://www.jmir.org/2020/4/e19016/ - PMC - PubMed
    1. Abdul-Mageed M, Elmandany AR, Pabbi D, Verma K, Lin R (2020) Mega-COV: A billion-scale dataset of 100+ languages for COVID-19. https://arxiv.org/abs/2005.06012
    1. Abokhodair N, Yoo D, McDonald, DW (2015) Dissecting a social botnet: Growth, content and influence in twitter. 18th ACM Conference on Computer Supported Cooperative Work & Social Computing, 839–851.
    1. Aiello LM, Quercia D, Zhou K, Constantinides M, Šćepanović, S, Joglekar, S (2020) How epidemic psychology works on social media: Evolution of responses to the COVID-19 pandemic. https://arxiv.org/abs/2007.13169

LinkOut - more resources