. 2022 Aug 18:8:e1059.

doi: 10.7717/peerj-cs.1059. eCollection 2022.

Investigating toxicity changes of cross-community redditors from 2 billion posts and comments

Hind Almerekhi¹, Haewoon Kwak², Bernard J Jansen³

Affiliations

¹ Hamad Bin Khalifa University, Doha, Qatar.
² Singapore Management University, Singapore, Singapore.
³ Qatar Computing Research Institute, HBKU, Doha, Qatar.

PMID: 36092019
PMCID: PMC9455283
DOI: 10.7717/peerj-cs.1059

Investigating toxicity changes of cross-community redditors from 2 billion posts and comments

Hind Almerekhi et al. PeerJ Comput Sci. 2022.

. 2022 Aug 18:8:e1059.

doi: 10.7717/peerj-cs.1059. eCollection 2022.

Authors

Hind Almerekhi¹, Haewoon Kwak², Bernard J Jansen³

Affiliations

¹ Hamad Bin Khalifa University, Doha, Qatar.
² Singapore Management University, Singapore, Singapore.
³ Qatar Computing Research Institute, HBKU, Doha, Qatar.

PMID: 36092019
PMCID: PMC9455283
DOI: 10.7717/peerj-cs.1059

Abstract

This research investigates changes in online behavior of users who publish in multiple communities on Reddit by measuring their toxicity at two levels. With the aid of crowdsourcing, we built a labeled dataset of 10,083 Reddit comments, then used the dataset to train and fine-tune a Bidirectional Encoder Representations from Transformers (BERT) neural network model. The model predicted the toxicity levels of 87,376,912 posts from 577,835 users and 2,205,581,786 comments from 890,913 users on Reddit over 16 years, from 2005 to 2020. This study utilized the toxicity levels of user content to identify toxicity changes by the user within the same community, across multiple communities, and over time. As for the toxicity detection performance, the BERT model achieved a 91.27% classification accuracy and an area under the receiver operating characteristic curve (AUC) score of 0.963 and outperformed several baseline machine learning and neural network models. The user behavior toxicity analysis showed that 16.11% of users publish toxic posts, and 13.28% of users publish toxic comments. However, results showed that 30.68% of users publishing posts and 81.67% of users publishing comments exhibit changes in their toxicity across different communities, indicating that users adapt their behavior to the communities' norms. Furthermore, time series analysis with the Granger causality test of the volume of links and toxicity in user content showed that toxic comments are Granger caused by links in comments.

Keywords: Machine learning; Online communities; Online hate; Posting behavior; Reddit; Toxicity.

PubMed Disclaimer

Conflict of interest statement

The authors declare there are no competing interests.

Figures

**Figure 1. A Reddit post from the subreddit “r/science” with its associated discussion threads.**

**Figure 2. Cumulative distribution function of the participating subreddits per user in (A) posts and (B) comments.**

**Figure 3. The count of toxicity changes over time in posting users from condition 1 (NT →T), condition 2 (T →NT), and both conditions combined.**

**Figure 4. The count of toxicity changes over time in commenting users from condition 1 (NT →T), condition 2 (T →NT), and both conditions combined.**

**Figure 5. (A–D) Heatmap plots of the Δ in user posts and comments over two pairs of years.**
The dark color in the heatmap plot denotes scattered deltas while the light colors denotes concentrated deltas in specific locations.

**Figure 6. The total amount of Δ in posting users content over time with an interpolation of Δ averages across three year intervals.**

**Figure 7. The total amount of Δ in commenting users content over time with an interpolation of Δ averages across three year intervals.**

**Figure 8. The total number of posts, toxic posts, and links in every year followed by the normalized totals using the min-max scale.**

**Figure 9. The distribution of internal and external links, followed by the total number of known media types and image links from the posts collection.**

**Figure 10. The total number of comments, toxic comments, and links in every year followed by the normalized totals using the min-max scale.**

**Figure 11. The distribution of internal and external links, followed by the total number of known media types and image links from the comments collection.**

**Figure 12. Correlation between the total number of participating subreddits over time and (A) the total number of toxic posts and (B) the total number of toxic comments.**

**Figure B1. The labeling task instructions that we provided to crowd workers.**

**Figure B2. The validation questions that crowd workers had to pass before beginning labeling.**

See this image and copyright information in PMC

Cited by

Special issue on analysis and mining of social media data.
Zubiaga A, Rosso P. Zubiaga A, et al. PeerJ Comput Sci. 2024 Feb 29;10:e1909. doi: 10.7717/peerj-cs.1909. eCollection 2024. PeerJ Comput Sci. 2024. PMID: 38435569 Free PMC article.
Bibliometric Analysis of Granger Causality Studies.
Lam WS, Lam WH, Jaaman SH, Lee PF. Lam WS, et al. Entropy (Basel). 2023 Apr 7;25(4):632. doi: 10.3390/e25040632. Entropy (Basel). 2023. PMID: 37190420 Free PMC article.
Tracking patterns in toxicity and antisocial behavior over user lifetimes on large social media platforms.
Blumer K, Kleinberg J. Blumer K, et al. Sci Rep. 2025 Jul 14;15(1):25369. doi: 10.1038/s41598-025-07086-3. Sci Rep. 2025. PMID: 40659657 Free PMC article.

References

1. Alfonso F, Morris K. The most influential people on Reddit in 2013. 2013. https://www.dailydot.com/irl/reddit-top-10-2013-quickmeme-unidan-boston-... [19 April 2019]. https://www.dailydot.com/irl/reddit-top-10-2013-quickmeme-unidan-boston-...
1. Almerekhi H, Kwak H, Jansen BJ. Investigating toxicity across multiple Reddit communities, users, and moderators. Companion proceedings of the web conference 2020; New York, NY, USA. 2020. pp. 294–298.
1. Ashraf N, Zubiaga A, Gelbukh A. Abusive language detection in youtube comments leveraging replies as conversational context. PeerJ Computer Science. 2021;7:e742. doi: 10.7717/peerj-cs.742. - DOI - PMC - PubMed
1. Badjatiya P, Gupta S, Gupta M, Varma V. Deep learning for hate speech detection in tweets. Proceedings of the 26th international conference on world wide web companion; Republic and Canton of Geneva, CHE. New York. 2017. pp. 759–760.
1. Batista GEAPA, Prati RC, Monard MC. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter. 2004;6(1):20–29. doi: 10.1145/1007730.1007735. - DOI

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Investigating toxicity changes of cross-community redditors from 2 billion posts and comments

Affiliations

Investigating toxicity changes of cross-community redditors from 2 billion posts and comments

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources

Research Materials