HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation
Conference papers

Exploring Conditional Language Model Based Data Augmentation Approaches For Hate Speech Classification

Abstract : Deep Neural Network (DNN) based classifiers have gained increased attention in hate speech classification. However, the performance of DNN classifiers increases with quantity of available training data and in reality, hate speech datasets consist of only a small amount of labeled data. To counter this, Data Augmentation (DA) techniques are often used to increase the number of labeled samples and therefore, improve the classifier's performance. In this article, we explore augmentation of training samples using a conditional language model. Our approach uses a single class conditioned Generative Pre-Trained Transformer-2 (GPT-2) language model for DA, avoiding the need for multiple class specific GPT-2 models. We study the effect of increasing the quantity of the augmented data and show that adding a few hundred samples significantly improves the classifier's performance. Furthermore, we evaluate the effect of filtering the generated data used for DA. Our approach demonstrates up to 7.3% and up to 25.0% of relative improvements in macro-averaged F1 on two widely used hate speech corpora.
Complete list of metadata

https://hal.inria.fr/hal-03244472
Contributor : Ashwin Geet d'Sa Connect in order to contact the contributor
Submitted on : Tuesday, June 1, 2021 - 11:33:13 AM
Last modification on : Thursday, May 5, 2022 - 10:21:31 AM
Long-term archiving on: : Thursday, September 2, 2021 - 6:37:56 PM

File

Article_on_DA_for_HAL.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-03244472, version 1

Citation

Ashwin Geet d'Sa, Irina Illina, Dominique Fohr, Dietrich Klakow, Dana Ruiter. Exploring Conditional Language Model Based Data Augmentation Approaches For Hate Speech Classification. TSD 2021 - 24th International Conference on Text, Speech and Dialogue, Sep 2021, Olomouc, Czech Republic. ⟨hal-03244472⟩

Share

Metrics

Record views

89

Files downloads

150