Investigating toxicity changes of cross-community redditors from 2 billion posts and comments
- PMID: 36092019
- PMCID: PMC9455283
- DOI: 10.7717/peerj-cs.1059
Investigating toxicity changes of cross-community redditors from 2 billion posts and comments
Abstract
This research investigates changes in online behavior of users who publish in multiple communities on Reddit by measuring their toxicity at two levels. With the aid of crowdsourcing, we built a labeled dataset of 10,083 Reddit comments, then used the dataset to train and fine-tune a Bidirectional Encoder Representations from Transformers (BERT) neural network model. The model predicted the toxicity levels of 87,376,912 posts from 577,835 users and 2,205,581,786 comments from 890,913 users on Reddit over 16 years, from 2005 to 2020. This study utilized the toxicity levels of user content to identify toxicity changes by the user within the same community, across multiple communities, and over time. As for the toxicity detection performance, the BERT model achieved a 91.27% classification accuracy and an area under the receiver operating characteristic curve (AUC) score of 0.963 and outperformed several baseline machine learning and neural network models. The user behavior toxicity analysis showed that 16.11% of users publish toxic posts, and 13.28% of users publish toxic comments. However, results showed that 30.68% of users publishing posts and 81.67% of users publishing comments exhibit changes in their toxicity across different communities, indicating that users adapt their behavior to the communities' norms. Furthermore, time series analysis with the Granger causality test of the volume of links and toxicity in user content showed that toxic comments are Granger caused by links in comments.
Keywords: Machine learning; Online communities; Online hate; Posting behavior; Reddit; Toxicity.
©2022 Almerekhi et al.
Conflict of interest statement
The authors declare there are no competing interests.
Figures














Similar articles
-
Predicting Age Groups of Reddit Users Based on Posting Behavior and Metadata: Classification Model Development and Validation.JMIR Public Health Surveill. 2021 Mar 16;7(3):e25807. doi: 10.2196/25807. JMIR Public Health Surveill. 2021. PMID: 33724195 Free PMC article.
-
Endometriosis Online Communities: How Machine Learning Can Help Physicians Understand What Patients Are Discussing Online.J Minim Invasive Gynecol. 2024 Dec;31(12):1011-1018.e3. doi: 10.1016/j.jmig.2024.08.001. Epub 2024 Aug 10. J Minim Invasive Gynecol. 2024. PMID: 39134239
-
Understanding Mental Health Issues in Different Subdomains of Social Networking Services: Computational Analysis of Text-Based Reddit Posts.J Med Internet Res. 2023 Nov 30;25:e49074. doi: 10.2196/49074. J Med Internet Res. 2023. PMID: 38032730 Free PMC article.
-
Characterizing Social Media Messages Related to Underage JUUL E-Cigarette Buying and Selling: Cross-Sectional Analysis of Reddit Subreddits.J Med Internet Res. 2020 Jul 20;22(7):e16962. doi: 10.2196/16962. J Med Internet Res. 2020. PMID: 32706661 Free PMC article.
-
Factors Associated With Weight Change in Online Weight Management Communities: A Case Study in the LoseIt Reddit Community.J Med Internet Res. 2017 Jan 16;19(1):e17. doi: 10.2196/jmir.5816. J Med Internet Res. 2017. PMID: 28093378 Free PMC article.
Cited by
-
Special issue on analysis and mining of social media data.PeerJ Comput Sci. 2024 Feb 29;10:e1909. doi: 10.7717/peerj-cs.1909. eCollection 2024. PeerJ Comput Sci. 2024. PMID: 38435569 Free PMC article.
-
Bibliometric Analysis of Granger Causality Studies.Entropy (Basel). 2023 Apr 7;25(4):632. doi: 10.3390/e25040632. Entropy (Basel). 2023. PMID: 37190420 Free PMC article.
-
Tracking patterns in toxicity and antisocial behavior over user lifetimes on large social media platforms.Sci Rep. 2025 Jul 14;15(1):25369. doi: 10.1038/s41598-025-07086-3. Sci Rep. 2025. PMID: 40659657 Free PMC article.
References
-
- Alfonso F, Morris K. The most influential people on Reddit in 2013. 2013. https://www.dailydot.com/irl/reddit-top-10-2013-quickmeme-unidan-boston-... [19 April 2019]. https://www.dailydot.com/irl/reddit-top-10-2013-quickmeme-unidan-boston-...
-
- Almerekhi H, Kwak H, Jansen BJ. Investigating toxicity across multiple Reddit communities, users, and moderators. Companion proceedings of the web conference 2020; New York, NY, USA. 2020. pp. 294–298.
-
- Badjatiya P, Gupta S, Gupta M, Varma V. Deep learning for hate speech detection in tweets. Proceedings of the 26th international conference on world wide web companion; Republic and Canton of Geneva, CHE. New York. 2017. pp. 759–760.
-
- Batista GEAPA, Prati RC, Monard MC. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter. 2004;6(1):20–29. doi: 10.1145/1007730.1007735. - DOI
LinkOut - more resources
Full Text Sources
Research Materials