Multi-label emotion classification of Urdu tweets
- PMID: 35494831
- PMCID: PMC9044368
- DOI: 10.7717/peerj-cs.896
Multi-label emotion classification of Urdu tweets
Abstract
Urdu is a widely used language in South Asia and worldwide. While there are similar datasets available in English, we created the first multi-label emotion dataset consisting of 6,043 tweets and six basic emotions in the Urdu Nastalíq script. A multi-label (ML) classification approach was adopted to detect emotions from Urdu. The morphological and syntactic structure of Urdu makes it a challenging problem for multi-label emotion detection. In this paper, we build a set of baseline classifiers such as machine learning algorithms (Random forest (RF), Decision tree (J48), Sequential minimal optimization (SMO), AdaBoostM1, and Bagging), deep-learning algorithms (Convolutional Neural Networks (1D-CNN), Long short-term memory (LSTM), and LSTM with CNN features) and transformer-based baseline (BERT). We used a combination of text representations: stylometric-based features, pre-trained word embedding, word-based n-grams, and character-based n-grams. The paper highlights the annotation guidelines, dataset characteristics and insights into different methodologies used for Urdu based emotion classification. We present our best results using micro-averaged F1, macro-averaged F1, accuracy, Hamming loss (HL) and exact match (EM) for all tested methods.
Keywords: Deep learning; Emotion classification in Urdu; Emotion detection; Machine learning; Multi-label emotion detection; Natural language processing.
© 2022 Ashraf et al.
Conflict of interest statement
The authors declare that they have no competing interests.
Figures
References
-
- Adeeba F, Hussain S. Experiences in building Urdu wordnet. Proceedings of the 9th Workshop on Asian Language Resources; 2011. pp. 31–35.
-
- Alm CO, Roth D, Sproat R. Emotions from text: machine learning for text-based emotion prediction. Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, HLT, 2005; Stroudsburg: Association for Computational Linguistics; 2005. pp. 579–586.
-
- Aman S, Szpakowicz S. Identifying expressions of emotion in text. Proceedings of the 10th International Conference on Text, Speech and Dialogue, TSD’07; Berlin: Springer-Verlag; 2007. pp. 196–205.
-
- Ameer I, Ashraf N, Sidorov G, Adorno HG. Multi-label emotion classification using content-based features in Twitter. Computación y Sistemas. 2021;24(3):1159–1164. doi: 10.13053/CyS-24-3-3476. - DOI
-
- Amjad M, Ashraf N, Zhila A, Sidorov G, Zubiaga A, Gelbukh A. Threatening language detection and target identification in Urdu tweets. IEEE Access. 2021;9:128302–128313. doi: 10.1109/ACCESS.2021.3112500. - DOI
LinkOut - more resources
Full Text Sources
Miscellaneous