Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Aug 18:8:e1059.
doi: 10.7717/peerj-cs.1059. eCollection 2022.

Investigating toxicity changes of cross-community redditors from 2 billion posts and comments

Affiliations

Investigating toxicity changes of cross-community redditors from 2 billion posts and comments

Hind Almerekhi et al. PeerJ Comput Sci. .

Abstract

This research investigates changes in online behavior of users who publish in multiple communities on Reddit by measuring their toxicity at two levels. With the aid of crowdsourcing, we built a labeled dataset of 10,083 Reddit comments, then used the dataset to train and fine-tune a Bidirectional Encoder Representations from Transformers (BERT) neural network model. The model predicted the toxicity levels of 87,376,912 posts from 577,835 users and 2,205,581,786 comments from 890,913 users on Reddit over 16 years, from 2005 to 2020. This study utilized the toxicity levels of user content to identify toxicity changes by the user within the same community, across multiple communities, and over time. As for the toxicity detection performance, the BERT model achieved a 91.27% classification accuracy and an area under the receiver operating characteristic curve (AUC) score of 0.963 and outperformed several baseline machine learning and neural network models. The user behavior toxicity analysis showed that 16.11% of users publish toxic posts, and 13.28% of users publish toxic comments. However, results showed that 30.68% of users publishing posts and 81.67% of users publishing comments exhibit changes in their toxicity across different communities, indicating that users adapt their behavior to the communities' norms. Furthermore, time series analysis with the Granger causality test of the volume of links and toxicity in user content showed that toxic comments are Granger caused by links in comments.

Keywords: Machine learning; Online communities; Online hate; Posting behavior; Reddit; Toxicity.

PubMed Disclaimer

Conflict of interest statement

The authors declare there are no competing interests.

Figures

Figure 1
Figure 1. A Reddit post from the subreddit “r/science” with its associated discussion threads.
Figure 2
Figure 2. Cumulative distribution function of the participating subreddits per user in (A) posts and (B) comments.
Figure 3
Figure 3. The count of toxicity changes over time in posting users from condition 1 (NT →T), condition 2 (T →NT), and both conditions combined.
Figure 4
Figure 4. The count of toxicity changes over time in commenting users from condition 1 (NT →T), condition 2 (T →NT), and both conditions combined.
Figure 5
Figure 5. (A–D) Heatmap plots of the Δ in user posts and comments over two pairs of years.
The dark color in the heatmap plot denotes scattered deltas while the light colors denotes concentrated deltas in specific locations.
Figure 6
Figure 6. The total amount of Δ in posting users content over time with an interpolation of Δ averages across three year intervals.
Figure 7
Figure 7. The total amount of Δ in commenting users content over time with an interpolation of Δ averages across three year intervals.
Figure 8
Figure 8. The total number of posts, toxic posts, and links in every year followed by the normalized totals using the min-max scale.
Figure 9
Figure 9. The distribution of internal and external links, followed by the total number of known media types and image links from the posts collection.
Figure 10
Figure 10. The total number of comments, toxic comments, and links in every year followed by the normalized totals using the min-max scale.
Figure 11
Figure 11. The distribution of internal and external links, followed by the total number of known media types and image links from the comments collection.
Figure 12
Figure 12. Correlation between the total number of participating subreddits over time and (A) the total number of toxic posts and (B) the total number of toxic comments.
Figure B1
Figure B1. The labeling task instructions that we provided to crowd workers.
Figure B2
Figure B2. The validation questions that crowd workers had to pass before beginning labeling.

Similar articles

Cited by

References

    1. Alfonso F, Morris K. The most influential people on Reddit in 2013. 2013. https://www.dailydot.com/irl/reddit-top-10-2013-quickmeme-unidan-boston-... [19 April 2019]. https://www.dailydot.com/irl/reddit-top-10-2013-quickmeme-unidan-boston-...
    1. Almerekhi H, Kwak H, Jansen BJ. Investigating toxicity across multiple Reddit communities, users, and moderators. Companion proceedings of the web conference 2020; New York, NY, USA. 2020. pp. 294–298.
    1. Ashraf N, Zubiaga A, Gelbukh A. Abusive language detection in youtube comments leveraging replies as conversational context. PeerJ Computer Science. 2021;7:e742. doi: 10.7717/peerj-cs.742. - DOI - PMC - PubMed
    1. Badjatiya P, Gupta S, Gupta M, Varma V. Deep learning for hate speech detection in tweets. Proceedings of the 26th international conference on world wide web companion; Republic and Canton of Geneva, CHE. New York. 2017. pp. 759–760.
    1. Batista GEAPA, Prati RC, Monard MC. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter. 2004;6(1):20–29. doi: 10.1145/1007730.1007735. - DOI

LinkOut - more resources