An augmented multilingual Twitter dataset for studying the COVID-19 infodemic
- PMID: 34697560
- PMCID: PMC8528187
- DOI: 10.1007/s13278-021-00825-0
An augmented multilingual Twitter dataset for studying the COVID-19 infodemic
Abstract
This work presents an openly available dataset to facilitate researchers' exploration and hypothesis testing about the social discourse of the COVID-19 pandemic. The dataset currently consists of over 2.2 billions tweets (count as of September, 2021), from all over the world, in multiple languages. Tweets start from January 22, 2020, when the total cases of reported COVID-19 were below 600 worldwide. The dataset was collected using the Twitter API and by rehydrating tweets from other available datasets, data collection is ongoing as of the time of writing. To facilitate hypothesis testing and exploration of social discourse, the English and Spanish tweets have been augmented with state-of-the-art Twitter Sentiment and Named Entity Recognition algorithms. The dataset and the summary files provided allow researchers to avoid some computationally intensive analyses, facilitating more widespread use of social media data to gain insights on issues such as (mis)information diffusion, semantic networks, sentiments, and the evolution of COVID-19 discussions. In addition, the dataset provides an archive for researchers in the social sciences wishing to have access to a dataset covering the entire duration of the pandemic.
Keywords: COVID-19; Named Entity Recognition; Sentiment analysis; Twitter.
© The Author(s), under exclusive licence to Springer-Verlag GmbH Austria, part of Springer Nature 2021.
Conflict of interest statement
Conflicts of interestWe have no competing interests to declare.
Figures









Similar articles
-
Mpox Panic, Infodemic, and Stigmatization of the Two-Spirit, Lesbian, Gay, Bisexual, Transgender, Queer or Questioning, Intersex, Asexual Community: Geospatial Analysis, Topic Modeling, and Sentiment Analysis of a Large, Multilingual Social Media Database.J Med Internet Res. 2023 May 1;25:e45108. doi: 10.2196/45108. J Med Internet Res. 2023. PMID: 37126377 Free PMC article.
-
CoWIN twitter dataset: A comprehensive collection of public discourse on India's COVID-19 vaccination platform.Data Brief. 2025 Jan 2;58:111252. doi: 10.1016/j.dib.2024.111252. eCollection 2025 Feb. Data Brief. 2025. PMID: 39877807 Free PMC article.
-
MonkeyPox2022Tweets: A Large-Scale Twitter Dataset on the 2022 Monkeypox Outbreak, Findings from Analysis of Tweets, and Open Research Questions.Infect Dis Rep. 2022 Nov 14;14(6):855-883. doi: 10.3390/idr14060087. Infect Dis Rep. 2022. PMID: 36412745 Free PMC article.
-
Tracking discussions of complementary, alternative, and integrative medicine in the context of the COVID-19 pandemic: a month-by-month sentiment analysis of Twitter data.BMC Complement Med Ther. 2022 Apr 13;22(1):105. doi: 10.1186/s12906-022-03586-1. BMC Complement Med Ther. 2022. PMID: 35418205 Free PMC article.
-
Topics, Trends, and Sentiments of Tweets About the COVID-19 Pandemic: Temporal Infoveillance Study.J Med Internet Res. 2020 Oct 23;22(10):e22624. doi: 10.2196/22624. J Med Internet Res. 2020. PMID: 33006937 Free PMC article.
Cited by
-
COVIDHealth: A novel labeled dataset and machine learning-based web application for classifying COVID-19 discourses on Twitter.Heliyon. 2024 Jul 8;10(14):e34103. doi: 10.1016/j.heliyon.2024.e34103. eCollection 2024 Jul 30. Heliyon. 2024. PMID: 39100452 Free PMC article.
-
Social media mining under the COVID-19 context: Progress, challenges, and opportunities.Int J Appl Earth Obs Geoinf. 2022 Sep;113:102967. doi: 10.1016/j.jag.2022.102967. Epub 2022 Aug 19. Int J Appl Earth Obs Geoinf. 2022. PMID: 36035895 Free PMC article.
-
Applications of machine learning for COVID-19 misinformation: a systematic review.Soc Netw Anal Min. 2022;12(1):94. doi: 10.1007/s13278-022-00921-9. Epub 2022 Jul 29. Soc Netw Anal Min. 2022. PMID: 35919516 Free PMC article. Review.
-
Public Opinion About COVID-19 on a Microblog Platform in China: Topic Modeling and Multidimensional Sentiment Analysis of Social Media.J Med Internet Res. 2024 Jan 31;26:e47508. doi: 10.2196/47508. J Med Internet Res. 2024. PMID: 38294856 Free PMC article.
-
Using Twitter data to understand public perceptions of approved versus off-label use for COVID-19-related medications.J Am Med Inform Assoc. 2022 Sep 12;29(10):1668-1678. doi: 10.1093/jamia/ocac114. J Am Med Inform Assoc. 2022. PMID: 35775946 Free PMC article.
References
-
- Banda JM, Tekumalla R, Wang G, Yu J, Liu T, Ding Y, Artemova K, Tutubalina E, Chowell G. (2020) A large-scale COVID-19 Twitter chatter dataset for open scientific research - An international collaboration. https://zenodo.org/record/4065674#.X38ef9BKjb0 - PMC - PubMed
-
- Abd-Alrazaq A, Alhuwail D, Househ M, Hamdi M, Shah Z (2020) Top concerns of tweeters during the COVID-19 pandemic: infoveillance study. Journal of Medical Internet Research, 22(4). https://www.jmir.org/2020/4/e19016/ - PMC - PubMed
-
- Abdul-Mageed M, Elmandany AR, Pabbi D, Verma K, Lin R (2020) Mega-COV: A billion-scale dataset of 100+ languages for COVID-19. https://arxiv.org/abs/2005.06012
-
- Abokhodair N, Yoo D, McDonald, DW (2015) Dissecting a social botnet: Growth, content and influence in twitter. 18th ACM Conference on Computer Supported Cooperative Work & Social Computing, 839–851.
-
- Aiello LM, Quercia D, Zhou K, Constantinides M, Šćepanović, S, Joglekar, S (2020) How epidemic psychology works on social media: Evolution of responses to the COVID-19 pandemic. https://arxiv.org/abs/2007.13169
Publication types
LinkOut - more resources
Full Text Sources