. 2024 Dec 5;19(12):e0307186.

doi: 10.1371/journal.pone.0307186. eCollection 2024.

An automated approach to identify sarcasm in low-resource language

Shumaila Khan¹, Iqbal Qasim¹, Wahab Khan¹, Aurangzeb Khan¹, Javed Ali Khan², Ayman Qahmash³, Yazeed Yasin Ghadi⁴

Affiliations

¹ Institute of CS & IT, University of Science & Technology, Bannu, Pakistan.
² Department of Computer Science, School of Physics, Engineering & Computer Science, University of Hertfordshire, Hatfield, United Kingdom.
³ Department of Informatics and Computer Systems, King Khalid University, Abha, Saudi Arabia.
⁴ Department of Computer Science, Al Ain University, Al Ain, UAE.

PMID: 39637015
PMCID: PMC11620596
DOI: 10.1371/journal.pone.0307186

An automated approach to identify sarcasm in low-resource language

Shumaila Khan et al. PLoS One. 2024.

. 2024 Dec 5;19(12):e0307186.

doi: 10.1371/journal.pone.0307186. eCollection 2024.

Authors

Shumaila Khan¹, Iqbal Qasim¹, Wahab Khan¹, Aurangzeb Khan¹, Javed Ali Khan², Ayman Qahmash³, Yazeed Yasin Ghadi⁴

Affiliations

¹ Institute of CS & IT, University of Science & Technology, Bannu, Pakistan.
² Department of Computer Science, School of Physics, Engineering & Computer Science, University of Hertfordshire, Hatfield, United Kingdom.
³ Department of Informatics and Computer Systems, King Khalid University, Abha, Saudi Arabia.
⁴ Department of Computer Science, Al Ain University, Al Ain, UAE.

PMID: 39637015
PMCID: PMC11620596
DOI: 10.1371/journal.pone.0307186

Abstract

Sarcasm detection has emerged due to its applicability in natural language processing (NLP) but lacks substantial exploration in low-resource languages like Urdu, Arabic, Pashto, and Roman-Urdu. While fewer studies identifying sarcasm have focused on low-resource languages, most of the work is in English. This research addresses the gap by exploring the efficacy of diverse machine learning (ML) algorithms in identifying sarcasm in Urdu. The scarcity of annotated datasets for low-resource language becomes a challenge. To overcome the challenge, we curated and released a comparatively large dataset named Urdu Sarcastic Tweets (UST) Dataset, comprising user-generated comments from [Formula: see text] (former Twitter). Automatic sarcasm detection in text involves using computational methods to determine if a given statement is intended to be sarcastic. However, this task is challenging due to the influence of the user's behavior and attitude and their expression of emotions. To address this challenge, we employ various baseline ML classifiers to evaluate their effectiveness in detecting sarcasm in low-resource languages. The primary models evaluated in this study are support vector machine (SVM), decision tree (DT), K-Nearest Neighbor Classifier (K-NN), linear regression (LR), random forest (RF), Naïve Bayes (NB), and XGBoost. Our study's assessment involved validating the performance of these ML classifiers on two distinct datasets-the Tanz-Indicator and the UST dataset. The SVM classifier consistently outperformed other ML models with an accuracy of 0.85 across various experimental setups. This research underscores the importance of tailored sarcasm detection approaches to accommodate specific linguistic characteristics in low-resource languages, paving the way for future investigations. By providing open access to the UST dataset, we encourage its use as a benchmark for sarcasm detection research in similar linguistic contexts.

Copyright: © 2024 Khan et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. Overview of the proposed research methodology.**

**Fig 3. ROC curve showing the performance of ML classifiers.**

**Fig 4. Learning curves to train the machine learning classifier.**

**Fig 5. Confusion matrix generated for the ML classifier 4a for logistic regression and 4b for random forest.**

**Fig 6. Performance comparison of classifiers on different datasets.**

See this image and copyright information in PMC

References

1. Gibbs RW. Irony in talk among friends. Metaphor and symbol. 2000;15(1–2):5–27.
1. Huang C, Han Z, Li M, Wang X, Zhao WJAJoET. Sentiment evolution with interaction levels in blended learning environments: Using learning analytics and epistemic network analysis. 2021;37(2):81–95.
1. Bushman BJ, Bonacci AM, Van Dijk M, Baumeister RF. Narcissism, sexual refusal, and aggression: testing a narcissistic reactance model of sexual coercion. Journal of personality and social psychology. 2003;84(5):1027. doi: 10.1037/0022-3514.84.5.1027 - DOI - PubMed
1. Davidov D, Tsur O, Rappoport A, editors. Semi-supervised recognition of sarcasm in Twitter and Amazon. Proceedings of the fourteenth conference on computational natural language learning; 2010.
1. Joshi A, Sharma V, Bhattacharyya P, editors. Harnessing context incongruity for sarcasm detection. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers); 2015.

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- PubMed Central
- Public Library of Science
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

An automated approach to identify sarcasm in low-resource language

Affiliations

An automated approach to identify sarcasm in low-resource language

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Miscellaneous