TARGE: large language model-powered explainable hate speech detection

Muhammad Haseeb Hashir¹, Memoona¹, Sung Won Kim²

Affiliations

¹ Information and Communication Engineering, Yeungnam University, Gyeongsan, Gyeongbuk, Republic of South Korea.
² School of Computer Science and Engineering, Yeungnam University, Gyeongsan, Gyeongbuk, Republic of South Korea.

PMID: 40567784
PMCID: PMC12192871
DOI: 10.7717/peerj-cs.2911

TARGE: large language model-powered explainable hate speech detection

Muhammad Haseeb Hashir et al. PeerJ Comput Sci. 2025.

. 2025 May 30:11:e2911.

doi: 10.7717/peerj-cs.2911. eCollection 2025.

Authors

Muhammad Haseeb Hashir¹, Memoona¹, Sung Won Kim²

Affiliations

¹ Information and Communication Engineering, Yeungnam University, Gyeongsan, Gyeongbuk, Republic of South Korea.
² School of Computer Science and Engineering, Yeungnam University, Gyeongsan, Gyeongbuk, Republic of South Korea.

PMID: 40567784
PMCID: PMC12192871
DOI: 10.7717/peerj-cs.2911

Abstract

The proliferation of user-generated content on social networking sites has intensified the challenge of accurately and efficiently detecting inflammatory and discriminatory speech at scale. Traditional manual moderation methods are impractical due to the sheer volume and complexity of online discourse, necessitating automated solutions. However, existing deep learning models for hate speech detection typically function as black-box systems, providing binary classifications without interpretable insights into their decision-making processes. This opacity significantly limits their practical utility, particularly in nuanced content moderation tasks. To address this challenge, our research explores leveraging the advanced reasoning and knowledge integration capabilities of state-of-the-art language models, specifically Mistral-7B, to develop transparent hate speech detection systems. We introduce a novel framework wherein large language models (LLMs) generate explicit rationales by identifying and analyzing critical textual features indicative of hate speech. These rationales are subsequently integrated into specialized classifiers designed to perform explainable content moderation. We rigorously evaluate our methodology on multiple benchmark English-language social media datasets. Results demonstrate that incorporating LLM-generated explanations significantly enhances both the interpretability and accuracy of hate speech detection. This approach not only identifies problematic content effectively but also clearly articulates the analytical rationale behind each decision, fulfilling the critical demand for transparency in automated content moderation.

Keywords: Hate speech; Large language models; Rationale extraction; Social media.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Figure 3. Proposed framework architecture.**

**Figure 4. Integrated gradients (IG) visualization of the proposed framework’s performance on the GAB dataset.**

**Figure 5. Integrated gradients (IG) visualization of the proposed framework’s performance on the Twitter dataset.**

**Figure 6. Integrated gradients (IG) visualization of the proposed framework’s performance on the ETHOS dataset.**

**Figure 7. Mistral-7B one-shot hate speech detection prompt and response.**

See this image and copyright information in PMC

References

1. Agarap AF. Deep learning using rectified linear units (relu) 2018 doi: 10.48550/arXiv.1803.08375. ArXiv. - DOI
1. Bertsimas D, Delarue A, Jaillet P, Martin S. The price of interpretability. 2019. ArXiv preprint. - DOI
1. Bhat S, Varma V. Large language models as annotators: a preliminary evaluation for annotating low-resource language content. In: Deutsch D, Dror R, Eger S, Gao Y, Leiter C, Opitz J, Rücklé A, editors. Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems. Bali, Indonesia: Association for Computational Linguistics; 2023. pp. 100–107.
1. Bhattacharjee A, Liu H. Fighting fire with fire: can ChatGPT detect AI-generated text? ACM SIGKDD Explorations Newsletter. 2024;25(2):14–21. doi: 10.1145/3655103.3655106. - DOI
1. Calabrese A, Neves L, Shah N, Bos M, Ross B, Lapata M, Barbieri F. Explainability and hate speech: structured explanations make social media moderators faster. In: Ku L-W, Martins A, Srikumar V, editors. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) Bangkok, Thailand: Association for Computational Linguistics; 2024. pp. 398–408.

LinkOut - more resources

Full Text Sources
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

TARGE: large language model-powered explainable hate speech detection

Affiliations

TARGE: large language model-powered explainable hate speech detection

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources