Exploring Conditional Language Model Based Data Augmentation Approaches For Hate Speech Classification

Ashwin Geet d'Sa; Irina Illina; Dominique Fohr; Dietrich Klakow; Dana Ruiter

Communication Dans Un Congrès Année : 2021

Exploring Conditional Language Model Based Data Augmentation Approaches For Hate Speech Classification

(1, 2) , (2) , (2) , (3) , (3)

1
2
3

Ashwin Geet d'Sa

Fonction : Auteur

Laboratoire Lorrain de Recherche en Informatique et ses Applications

Speech Modeling for Facilitating Oral-Based Communication

Irina Illina

Fonction : Auteur
PersonId : 15663
IdHAL : irina-illina
IdRef : 120731746

Speech Modeling for Facilitating Oral-Based Communication

Dominique Fohr

Fonction : Auteur
PersonId : 15652
IdHAL : dominique-fohr
IdRef : 031092942

Speech Modeling for Facilitating Oral-Based Communication

Dietrich Klakow

Fonction : Auteur
PersonId : 1095147

Universität des Saarlandes [Saarbrücken]

Dana Ruiter

Fonction : Auteur
PersonId : 1100508

Universität des Saarlandes [Saarbrücken]

Résumé

Deep Neural Network (DNN) based classifiers have gained increased attention in hate speech classification. However, the performance of DNN classifiers increases with quantity of available training data and in reality, hate speech datasets consist of only a small amount of labeled data. To counter this, Data Augmentation (DA) techniques are often used to increase the number of labeled samples and therefore, improve the classifier's performance. In this article, we explore augmentation of training samples using a conditional language model. Our approach uses a single class conditioned Generative Pre-Trained Transformer-2 (GPT-2) language model for DA, avoiding the need for multiple class specific GPT-2 models. We study the effect of increasing the quantity of the augmented data and show that adding a few hundred samples significantly improves the classifier's performance. Furthermore, we evaluate the effect of filtering the generated data used for DA. Our approach demonstrates up to 7.3% and up to 25.0% of relative improvements in macro-averaged F1 on two widely used hate speech corpora.

Mots clés

Natural language processing Hate speech classification Data augmentation

Domaines

Informatique [cs] Informatique et langage [cs.CL]

Fichier principal

Article_on_DA_for_HAL.pdf (305.8 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Ashwin Geet D'Sa : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03244472

Soumis le : mardi 1 juin 2021-11:33:13

Dernière modification le : lundi 11 septembre 2023-17:41:19

Archivage à long terme le : jeudi 2 septembre 2021-18:37:56

Dates et versions

hal-03244472 , version 1 (01-06-2021)

Identifiants

HAL Id : hal-03244472 , version 1

Citer

Ashwin Geet d'Sa, Irina Illina, Dominique Fohr, Dietrich Klakow, Dana Ruiter. Exploring Conditional Language Model Based Data Augmentation Approaches For Hate Speech Classification. TSD 2021 - 24th International Conference on Text, Speech and Dialogue, Sep 2021, Olomouc, Czech Republic. ⟨hal-03244472⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA GRID5000 UNIV-LORRAINE INRIA2 LORIA LORIA-NLPKD SILECS

132 Consultations

313 Téléchargements

Exploring Conditional Language Model Based Data Augmentation Approaches For Hate Speech Classification

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager