BERT and fastText Embeddings for Automatic Detection of Toxic Speech

Ashwin Geet d'Sa; Irina Illina; Dominique Fohr

Communication Dans Un Congrès Année : 2020

BERT and fastText Embeddings for Automatic Detection of Toxic Speech

(1) , (1) , (1)

Ashwin Geet d'Sa

Fonction : Auteur

Speech Modeling for Facilitating Oral-Based Communication

Irina Illina

Fonction : Auteur
PersonId : 15663
IdHAL : irina-illina
IdRef : 120731746

Speech Modeling for Facilitating Oral-Based Communication

Dominique Fohr

Fonction : Auteur
PersonId : 15652
IdHAL : dominique-fohr
IdRef : 031092942

Speech Modeling for Facilitating Oral-Based Communication

Résumé

With the expansion of Internet usage, catering to the dissemination of thoughts and expressions of an individual, there has been an immense increase in the spread of online hate speech. Social media, community forums, discussion platforms are few examples of common playground of online discussions where people are freely allowed to communicate. However, the freedom of speech may be misused by some people by arguing aggressively, offending others and spreading verbal violence. As there is no clear distinction between the terms offensive, abusive, hate and toxic speech, in this paper we consider the above mentioned terms as toxic speech. In many countries, online toxic speech is punishable by the law. Thus, it is important to automatically detect and remove toxic speech from online medias. Through this work, we propose automatic classification of toxic speech using embedding representations of words and deep-learning techniques. We perform binary and multi-class classification using a Twitter corpus and study two approaches: (a) a method which consists in extracting of word embeddings and then using a DNN classifier; (b) fine-tuning the pre-trained BERT model. We observed that BERT fine-tuning performed much better. Proposed methodology can be used for any other type of social media comments.

Mots clés

deep neural networks word embeddings hate speech detection Natural language processing

Domaines

Intelligence artificielle [cs.AI]

Fichier principal

IEEE_SIIE2020_v4.pdf (356.73 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Ashwin Geet D'Sa : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-02448197

Soumis le : mercredi 1 avril 2020-09:49:16

Dernière modification le : lundi 11 septembre 2023-17:41:19

Dates et versions

hal-02448197 , version 1 (22-01-2020)

hal-02448197 , version 2 (01-04-2020)

Identifiants

HAL Id : hal-02448197 , version 2

Citer

Ashwin Geet d'Sa, Irina Illina, Dominique Fohr. BERT and fastText Embeddings for Automatic Detection of Toxic Speech. SIIE 2020 - Information Systems and Economic Intelligence; International Multi-Conference on:“Organization of Knowledge and Advanced Technologies”(OCTA), Feb 2020, Tunis, Tunisia. ⟨hal-02448197v2⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA UNIV-LORRAINE INRIA2 LORIA LORIA-NLPKD

976 Consultations

6955 Téléchargements

BERT and fastText Embeddings for Automatic Detection of Toxic Speech

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager