From Noisy Questions to Minecraft Texts: Annotation Challenges in Extreme Syntax Scenarios

Abstract : User-generated content presents many challenges for its automatic processing. While many of them do come from out-of-vocabulary effects, others spawn from different linguistic phenomena such as unusual syntax. In this work we present a French three-domain data set made up of question headlines from a cooking forum, game chat logs and associated forums from two popular online games (MINECRAFT & LEAGUE OF LEGENDS). We chose these domains because they encompass different degrees of lexical and syntactic compliance with canonical language. We conduct an automatic and manual evaluation of the difficulties of processing these domains for part-of-speech prediction, and introduce a pilot study to determine whether dependency analysis lends itself well to annotate these data. We also discuss the development cost of our data set.
Type de document :
Communication dans un congrès
2nd Workshop on Noisy User-generated Text (W-NUT) at CoLing 2016, Dec 2016, Osaka, Japan. 2016
Liste complète des métadonnées

Littérature citée [33 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-01584054
Contributeur : Benoît Sagot <>
Soumis le : vendredi 8 septembre 2017 - 11:35:41
Dernière modification le : samedi 21 octobre 2017 - 01:06:30

Fichier

WNUT18.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-01584054, version 1

Collections

Citation

Héctor Alonso Martínez, Djamé Seddah, Benoît Sagot. From Noisy Questions to Minecraft Texts: Annotation Challenges in Extreme Syntax Scenarios. 2nd Workshop on Noisy User-generated Text (W-NUT) at CoLing 2016, Dec 2016, Osaka, Japan. 2016. 〈hal-01584054〉

Partager

Métriques

Consultations de la notice

60

Téléchargements de fichiers

18