Reducing Hate Speech on Social Media

A randomized experiment on the Twitter Network of Nigerian users comparing different interventions. In collaboration with the World bank.

The age of the internet promised an era of connectivity, diversity and inclusion. The internet enabled many new technologies such as social media which, for the most part, allowed unprecedented levels of communication between communities and more importantly provided a voice for marginalized groups. However, social media has had its own share of unintended consequences. Polarization has been on the rise across many societies and it has been attributed to social media platforms [1,2]. Polarization has been a major concern with social media as it provides a platform not just to marginalized communities but also divisive language and hateful content. More importantly, these platforms create an incentive structure for the most extreme voices and enable them to be amplified in ways that were not possible before the advent of social media [3,4].

In the absence of a precise architecture for content moderation, interventions for addressing polarization on online social media might need to be community-driven rather than punitive measures [5,6]. The promising solutions involve raising the costs of polarizing language throughout a user community by changing the norms around such language [7], the incentives to produce divisive content and promoting influential users who advocate for moderation and inclusivity. In this project, we focus on the incentives around hate speech as one manifestation of polarization on social media. Understanding ways for reducing the level of hate speech on social media is important since recent studies have argued for a potential link between outbursts of violence, social enmity and negative content on social media [8].

The goal of this study is to evaluate and compare the effectiveness of several interventions on reducing the frequency of hate speech among Twitter users in Nigeria. We focus on Nigeria as it provides an opportunity to study the effects of separatism, and ethnic-tensions since polarizing effects of hate speech in many countries often occur across ethnic lines on Twitter. The interventions are all community-driven in the sense that they don’t involve active content moderation by the platform, rather they attempt to change the norms and the incentive structure for disseminating tweets with hate speech. In particular, we attempt to reduce either the incentive for generating hateful content by targeting the users who regularly engage in such speech or the incentives for consuming hateful content by targeting followers of hate content producers. Effectively, our interventions differ by their recipients. They attempt to change the norms around (1) the production and (2) consumption of hate speech at the supply side or demand side or both and compare the effectiveness of each intervention.

The current literature on detrimental effects of social media mostly focuses on users who engage in problematic behavior, such as sharing misinformation or using polarizing and hateful language [9]. This approach assumes the problem can be addressed by changing the behavior of hate speech producers, but it overlooks the incentives behind such behaviors. The producers of hate speech often do so because there is demand for such behavior among the community of their followers. Any approach that does not address this aspect of online hate speech might not have lasting effects. The design and findings of this research are particularly novel as we view the dissemination of polarizing content within a larger ecosystem of incentives posed to both producers and consumers of such content. Understanding the effect of each component of this ecosystem on hate speech allows us to design interventions that potentially achieve long lasting effects by changing community-level norms.

In addition to varying the recipient of the intervention (consumer or producer), we also study the effect of different intervention contents. Previous research has shown the effect of pro-social messages by popular figures such as celebrities or authoritative figures [10]. We plan to partner with some Nigerian celebrities who are active on Twitter who make pro-social tweets that are authored by us and remind Nigerian twitter users of detrimental effects of hateful content and invite them to avoid such language. These pro-social tweets are then promoted as ads to the treated users as, either the producers or consumers of hate content as explained above. As a point of comparison on the celebrity effect of such pro-social messages, we plan to promote the exact same tweet content as ads, but made from the account of a co-ethnic user that is not a celebrity. In the Nigerian context, ethnicity can often be inferred based on first name or last name, thus by creating accounts with certain name patterns we can expose the producers or consumers of hate content to users of a specific ethnicity. In this intervention, the exact same message made by ethnic in-group accounts, who are controlled by us, are promoted to the treated users. The third and final intervention content is based on previous work on intergroup contact theory [11]. The contact theory holds that exposure to an out-group tends to promote tolerance and acceptance of the out-group. In the context of our experiment, we plan to create multiple accounts with certain names which can be inferred to be from the out-group ethnicity. These accounts will follow and actively engage the treated users on Twitter by interacting with their tweets, retweeting or liking them over a period of time.

In summary, this research project aims to evaluate the effectiveness of several community-based interventions on reducing the amount of hate content among Nigerian users of Twitter. In particular, the interventions vary across two dimensions of Who is the treated population? Do we achieve a better outcome by targeting the producers or consumers of hate content of both? What is the content of the intervention? Are prosocial messages more effective when they are promoted by celebrities or co-ethnic users? How effective are interventions based on contact theory on Twitter in terms of reducing inter-ethnic hate content?

Generating evidence on the effectiveness of these interventions can provide alternative strategies for reducing hate content on social media rather than punitive measures such as banning users or deleting content. Such punitive measures might actually lead to backfire effects [12] as it incentivizes users to migrate to alternative platforms and does not address the root causes of such behavior in the network of users. The proposed interventions potentially have long-lasting effects as they attempt to change the norms around such problematic behavior in the users network.


[1] Bail, Chris. “Breaking the social media prism.” In Breaking the Social Media Prism. Princeton University Press, (2021).
[2] Iyengar, Shanto, and Sean J. Westwood. “Fear and loathing across party lines: New evidence on group polarization.” American Journal of Political Science 59, no. 3 (2015): 690-707.
[3] Allcott, Hunt, and Matthew Gentzkow. “Social media and fake news in the 2016 election.” Journal of economic perspectives 31, no. 2 (2017): 211-36.
[4] Barberá, Pablo, and Gonzalo Rivero. “Understanding the political representativeness of Twitter users.” Social Science Computer Review 33, no. 6 (2015): 712-729.
[5] Jiménez Durán, Rafael. “The Economics of Content Moderation: Theory and Experimental Evidence from Hate Speech on Twitter.” Available at SSRN (2022).
[6] Myers West, Sarah. “Censored, suspended, shadowbanned: User interpretations of content moderation on social media platforms.” New Media & Society 20, no. 11 (2018): 4366-4383
[7] Freelon, Deen. “Discourse architecture, ideology, and democratic norms in online political discussion.” New media & society 17, no. 5 (2015): 772-791.
[8] Müller, Karsten, and Carlo Schwarz. “Fanning the flames of hate: Social media and hate crime.” Journal of the European Economic Association 19, no. 4 (2021): 2131-2167
[9] Bail, Christopher A., Lisa P. Argyle, Taylor W. Brown, John P. Bumpus, Haohan Chen, MB Fallin Hunzaker, Jaemin Lee, Marcus Mann, Friedolin Merhout, and Alexander Volfovsky. “Exposure to opposing views on social media can increase political polarization.” Proceedings of the National Academy of Sciences 115, no. 37 (2018): 9216-9221.
[10] Banerjee, Abhijit, Arun G. Chandrasekhar, Suresh Dalpath, Esther Duflo, John Floretta, Matthew O. Jackson, Harini Kannan et al. Selecting the most effective nudge: Evidence from a large-scale experiment on immunization. No. w28726. National Bureau of Economic Research, (2021).
[11] Pettigrew, Thomas F. “Intergroup contact theory.” Annual review of psychology 49, no. 1 (1998): 65-85. [12] Hobbs, William R., and Margaret E. Roberts. “How sudden censorship can increase access to information.” American Political Science Review 112, no. 3 (2018): 621-636.