Playing the Part of the Sharp Bully : Generating Adversarial Examples for Implicit Hate Speech Detection

Research on abusive content detection on social media has primarily focused on explicit forms of hate speech (HS), that are often identifiable by recognizing hateful words and expressions. Messages containing linguistically subtle and implicit forms of hate speech still constitute an open challenge for automatic hate speech detection. In this paper, we propose a new framework for generating adversarial implicit HS short-text messages using Auto-regressive Language Models. Moreover, we propose a strategy to group the generated implicit messages in complexity levels (EASY, MEDIUM, and HARD categories) characterizing how challenging these messages are for supervised classifiers. Finally, relying on (Dinan et al., 2019; Vidgen et al., 2021), we propose a “build it, break it, fix it”, training scheme using HARD messages showing how iteratively retraining on HARD messages substantially leverages SOTA models’ performances on implicit HS benchmarks.

Mots clés

Natural Language Processing Hate Speech

Domaines

Intelligence artificielle [cs.AI]

Nicolás Benjamín Ocampo : Connectez-vous pour contacter le contributeur

https://hal.science/hal-04214125

Soumis le : jeudi 21 septembre 2023-16:37:57

Dernière modification le : lundi 26 février 2024-11:22:07

Dates et versions

hal-04214125 , version 1 (21-09-2023)

Identifiants

HAL Id : hal-04214125 , version 1
DOI : 10.18653/v1/2023.findings-acl.173

Citer

Nicolás Benjamín Ocampo, Elena Cabrio, Serena Villata. Playing the Part of the Sharp Bully : Generating Adversarial Examples for Implicit Hate Speech Detection. ACL 2023 - 61st Annual Meeting of the Association for Computational Linguistics, Jul 2023, Toronto, Canada. pp.2758-2772, ⟨10.18653/v1/2023.findings-acl.173⟩. ⟨hal-04214125⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA I3S WIMMICS INRIA2 UNIV-COTEDAZUR 3IA-COTEDAZUR ANR

43 Consultations

0 Téléchargements