TClustVID: A novel machine learning classification model to investigate topics and sentiment in COVID-19 tweets

Md Shahriare Satu¹, Md Imran Khan², Mufti Mahmud³, Shahadat Uddin⁴, Matthew A Summers⁵, Julian M W Quinn⁵, Mohammad Ali Moni^{5

6}

Affiliations

¹ Department of Management Information Systems, Noakhali Science & Technology University, Noakhali, 3814, Bangladesh.
² Department of Computer Scienc & Engineering, Gono Bishwabidyalay, Savar, Dhaka, 1344, Bangladesh.
³ Department of Computer Science, and Medical Technology Innovation Facility, Nottingham Trent University, Clifton Campus, Clifton, Nottingham - NG11 8NS, UK.
⁴ Complex Systems Research Group, Faculty of Engineering, The University of Sydney, Darlington, NSW 2008, Australia.
⁵ The Garvan Institute of Medical Research, Healthy Ageing Theme, Darlinghurst, NSW 2010, Australia.
⁶ WHO Collaborating Centre on eHealth, UNSW Digital Health, School of Public Health and Community Medicine, Faculty of Medicine, University of New South Wales, Sydney, NSW 2052, Australia.

PMID: 33972817
PMCID: PMC8099549
DOI: 10.1016/j.knosys.2021.107126

TClustVID: A novel machine learning classification model to investigate topics and sentiment in COVID-19 tweets

Md Shahriare Satu et al. Knowl Based Syst. 2021.

. 2021 Aug 17:226:107126.

doi: 10.1016/j.knosys.2021.107126. Epub 2021 May 6.

Authors

Md Shahriare Satu¹, Md Imran Khan², Mufti Mahmud³, Shahadat Uddin⁴, Matthew A Summers⁵, Julian M W Quinn⁵, Mohammad Ali Moni^{5

6}

Affiliations

¹ Department of Management Information Systems, Noakhali Science & Technology University, Noakhali, 3814, Bangladesh.
² Department of Computer Scienc & Engineering, Gono Bishwabidyalay, Savar, Dhaka, 1344, Bangladesh.
³ Department of Computer Science, and Medical Technology Innovation Facility, Nottingham Trent University, Clifton Campus, Clifton, Nottingham - NG11 8NS, UK.
⁴ Complex Systems Research Group, Faculty of Engineering, The University of Sydney, Darlington, NSW 2008, Australia.
⁵ The Garvan Institute of Medical Research, Healthy Ageing Theme, Darlinghurst, NSW 2010, Australia.
⁶ WHO Collaborating Centre on eHealth, UNSW Digital Health, School of Public Health and Community Medicine, Faculty of Medicine, University of New South Wales, Sydney, NSW 2052, Australia.

PMID: 33972817
PMCID: PMC8099549
DOI: 10.1016/j.knosys.2021.107126

Abstract

COVID-19, caused by SARS-CoV2 infection, varies greatly in its severity but presents with serious respiratory symptoms with vascular and other complications, particularly in older adults. The disease can be spread by both symptomatic and asymptomatic infected individuals. Uncertainty remains over key aspects of the virus infectiousness (particularly the newly emerging variants) and the disease has had severe economic impacts globally. For these reasons, COVID-19 is the subject of intense and widespread discussion on social media platforms including Facebook and Twitter. These public forums substantially influence public opinions and in some cases can exacerbate the widespread panic and misinformation spread during the crisis. Thus, this work aimed to design an intelligent clustering-based classification and topic extracting model named TClustVID that analyzes COVID-19-related public tweets to extract significant sentiments with high accuracy. We gathered COVID-19 Twitter datasets from the IEEE Dataport repository and employed a range of data preprocessing methods to clean the raw data, then applied tokenization and produced a word-to-index dictionary. Thereafter, different classifications were employed on these datasets which enabled the exploration of the performance of traditional classification and TClustVID. Our analysis found that TClustVID showed higher performance compared to traditional methodologies that are determined by clustering criteria. Finally, we extracted significant topics from the clusters, split them into positive, neutral and negative sentiments, and identified the most frequent topics using the proposed model. This approach is able to rapidly identify commonly prevailing aspects of public opinions and attitudes related to COVID-19 and infection prevention strategies spreading among different populations.

Keywords: COVID-19; Classification; Machine learning; TClustVID; Topics modeling; Twitter data.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

**Fig. 1**
Details of working methodology where A. Data preprocessing B. Traditional classification and evaluation C. Clustering, classification and evaluation D. Comparison the outcomes between traditional and TClustVID E. Select the best clusters/datasets and Identify positive, neutral and negative clusters F. Extract topics by LDA and represent top frequent topics from it.

**Fig. 2**
Average performance of various classifiers for evaluating them using (a) traditional way (b) TClustVID corresponding to the nine twitter experimental datasets.

**Fig. 3**
Compute SHAP values to determine COVID-19 (a) Positive (b) Neutral (c) Negative topics.

**Fig. 4**
Word cloud of various topics.

**Fig. 5**
Positive topics of Cluster-3.

**Fig. 7**
Negative topics of Cluster-3.

**Fig. 8**
Top frequency of (a) Positive (b) Neutral (c) Negative COVID-19 associated topics.

See this image and copyright information in PMC

References

1. Lippi G., Plebani M. Procalcitonin in patients with severe coronavirus disease 2019 (covid-19): A meta-analysis. Clin. Chim. Acta; Int. J. Clin. Chem. 2020 - PMC - PubMed
1. Xu R.-H., He J.-F., Evans M.R., Peng G.-W., Field H.E., Yu D.-W., Lee C.-K., Luo H.-M., Lin W.-S., Lin P., et al. Epidemiologic clues to sars origin in China. Emerg. Infect. Diseases. 2004;10:1030. - PMC - PubMed
1. Cambria E. Affective computing and sentiment analysis. IEEE Intell. Syst. 2016;31:102–107. doi: 10.1109/MIS.2016.31. - DOI
1. Cambria E., Hussain A., Havasi C., Eckl C. In: Development of Multimodal Interfaces: Active Listening and Synchrony: Second COST 2102 International Training School, Dublin, Ireland, March (2009) 23-27, Revised Selected Papers. Esposito A., Campbell N., Vogel C., Hussain A., Nijholt A., editors. Springer; Berlin, Heidelberg: 2010. Sentic computing: Exploitation of common sense for the development of emotion-sensitive systems; pp. 148–156. (Lecture Notes in Computer Science). - DOI
1. Zhang H., Wheldon C., Dunn A.G., Tao C., Huo J., Zhang R., Prosperi M., Guo Y., Bian J. Mining twitter to assess the determinants of health behavior toward human papillomavirus vaccination in the United States. J. Am. Med. Inform. Assoc. 2020;27:225–235. - PMC - PubMed

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Other Literature Sources
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

TClustVID: A novel machine learning classification model to investigate topics and sentiment in COVID-19 tweets

Affiliations

TClustVID: A novel machine learning classification model to investigate topics and sentiment in COVID-19 tweets

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous