Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 May 30:11:e2911.
doi: 10.7717/peerj-cs.2911. eCollection 2025.

TARGE: large language model-powered explainable hate speech detection

Affiliations

TARGE: large language model-powered explainable hate speech detection

Muhammad Haseeb Hashir et al. PeerJ Comput Sci. .

Abstract

The proliferation of user-generated content on social networking sites has intensified the challenge of accurately and efficiently detecting inflammatory and discriminatory speech at scale. Traditional manual moderation methods are impractical due to the sheer volume and complexity of online discourse, necessitating automated solutions. However, existing deep learning models for hate speech detection typically function as black-box systems, providing binary classifications without interpretable insights into their decision-making processes. This opacity significantly limits their practical utility, particularly in nuanced content moderation tasks. To address this challenge, our research explores leveraging the advanced reasoning and knowledge integration capabilities of state-of-the-art language models, specifically Mistral-7B, to develop transparent hate speech detection systems. We introduce a novel framework wherein large language models (LLMs) generate explicit rationales by identifying and analyzing critical textual features indicative of hate speech. These rationales are subsequently integrated into specialized classifiers designed to perform explainable content moderation. We rigorously evaluate our methodology on multiple benchmark English-language social media datasets. Results demonstrate that incorporating LLM-generated explanations significantly enhances both the interpretability and accuracy of hate speech detection. This approach not only identifies problematic content effectively but also clearly articulates the analytical rationale behind each decision, fulfilling the critical demand for transparency in automated content moderation.

Keywords: Hate speech; Large language models; Rationale extraction; Social media.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Figure 1
Figure 1. Task prompt.
Figure 2
Figure 2. Task re-prompt.
Figure 3
Figure 3. Proposed framework architecture.
Figure 4
Figure 4. Integrated gradients (IG) visualization of the proposed framework’s performance on the GAB dataset.
Figure 5
Figure 5. Integrated gradients (IG) visualization of the proposed framework’s performance on the Twitter dataset.
Figure 6
Figure 6. Integrated gradients (IG) visualization of the proposed framework’s performance on the ETHOS dataset.
Figure 7
Figure 7. Mistral-7B one-shot hate speech detection prompt and response.

References

    1. Agarap AF. Deep learning using rectified linear units (relu) 2018 doi: 10.48550/arXiv.1803.08375. ArXiv. - DOI
    1. Bertsimas D, Delarue A, Jaillet P, Martin S. The price of interpretability. 2019. ArXiv preprint. - DOI
    1. Bhat S, Varma V. Large language models as annotators: a preliminary evaluation for annotating low-resource language content. In: Deutsch D, Dror R, Eger S, Gao Y, Leiter C, Opitz J, Rücklé A, editors. Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems. Bali, Indonesia: Association for Computational Linguistics; 2023. pp. 100–107.
    1. Bhattacharjee A, Liu H. Fighting fire with fire: can ChatGPT detect AI-generated text? ACM SIGKDD Explorations Newsletter. 2024;25(2):14–21. doi: 10.1145/3655103.3655106. - DOI
    1. Calabrese A, Neves L, Shah N, Bos M, Ross B, Lapata M, Barbieri F. Explainability and hate speech: structured explanations make social media moderators faster. In: Ku L-W, Martins A, Srikumar V, editors. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) Bangkok, Thailand: Association for Computational Linguistics; 2024. pp. 398–408.

LinkOut - more resources