TARGE: large language model-powered explainable hate speech detection
- PMID: 40567784
- PMCID: PMC12192871
- DOI: 10.7717/peerj-cs.2911
TARGE: large language model-powered explainable hate speech detection
Abstract
The proliferation of user-generated content on social networking sites has intensified the challenge of accurately and efficiently detecting inflammatory and discriminatory speech at scale. Traditional manual moderation methods are impractical due to the sheer volume and complexity of online discourse, necessitating automated solutions. However, existing deep learning models for hate speech detection typically function as black-box systems, providing binary classifications without interpretable insights into their decision-making processes. This opacity significantly limits their practical utility, particularly in nuanced content moderation tasks. To address this challenge, our research explores leveraging the advanced reasoning and knowledge integration capabilities of state-of-the-art language models, specifically Mistral-7B, to develop transparent hate speech detection systems. We introduce a novel framework wherein large language models (LLMs) generate explicit rationales by identifying and analyzing critical textual features indicative of hate speech. These rationales are subsequently integrated into specialized classifiers designed to perform explainable content moderation. We rigorously evaluate our methodology on multiple benchmark English-language social media datasets. Results demonstrate that incorporating LLM-generated explanations significantly enhances both the interpretability and accuracy of hate speech detection. This approach not only identifies problematic content effectively but also clearly articulates the analytical rationale behind each decision, fulfilling the critical demand for transparency in automated content moderation.
Keywords: Hate speech; Large language models; Rationale extraction; Social media.
© 2025 Hashir et al.
Conflict of interest statement
The authors declare that they have no competing interests.
Figures
References
-
- Agarap AF. Deep learning using rectified linear units (relu) 2018 doi: 10.48550/arXiv.1803.08375. ArXiv. - DOI
-
- Bertsimas D, Delarue A, Jaillet P, Martin S. The price of interpretability. 2019. ArXiv preprint. - DOI
-
- Bhat S, Varma V. Large language models as annotators: a preliminary evaluation for annotating low-resource language content. In: Deutsch D, Dror R, Eger S, Gao Y, Leiter C, Opitz J, Rücklé A, editors. Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems. Bali, Indonesia: Association for Computational Linguistics; 2023. pp. 100–107.
-
- Bhattacharjee A, Liu H. Fighting fire with fire: can ChatGPT detect AI-generated text? ACM SIGKDD Explorations Newsletter. 2024;25(2):14–21. doi: 10.1145/3655103.3655106. - DOI
-
- Calabrese A, Neves L, Shah N, Bos M, Ross B, Lapata M, Barbieri F. Explainability and hate speech: structured explanations make social media moderators faster. In: Ku L-W, Martins A, Srikumar V, editors. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) Bangkok, Thailand: Association for Computational Linguistics; 2024. pp. 398–408.
LinkOut - more resources
Full Text Sources