Cleaning noisy wordnets

Abstract : Automatic approaches to creating and extending wordnets, which have become very popular in the past decade, inadvertently result in noisy synsets. This is why we propose an approach to detect synset outliers in order to eliminate the noise and improve accuracy of the developed wordnets, so that they become more useful lexico-semantic resources for natural language applications. The approach compares the words that appear in the synset and its surroundings with the contexts of the literals in question they are used in based on large monolingual corpora. By fine-tuning the outlier threshold we can influence how many outlier candidates will be eliminated. Although the proposed approach is language-independent we test it on Slovene and French that were created automatically from bilingual resources and contain plenty of disambiguation errors. Manual evaluation of the results shows that by applying a threshold similar to the estimated error rate in the respective wordnets, 67% of the proposed outlier candidates are indeed incorrect for French and a 64% for Slovene. This is a big improvement compared to the estimated overall error rates in the resources, which are 12% for French and 15% for Slovene.
Type de document :
Communication dans un congrès
LREC 2012 - Eighth International Conference on Language Resources and Evaluation, May 2012, Istanbul, Turkey. 2012
Liste complète des métadonnées

https://hal.inria.fr/hal-00703125
Contributeur : Benoît Sagot <>
Soumis le : jeudi 31 mai 2012 - 21:35:53
Dernière modification le : samedi 9 juin 2018 - 10:30:06
Document(s) archivé(s) le : jeudi 15 décembre 2016 - 09:20:17

Fichier

LREC2012-sagot_fiser-published...
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-00703125, version 1

Collections

Citation

Benoît Sagot, Darja Fišer. Cleaning noisy wordnets. LREC 2012 - Eighth International Conference on Language Resources and Evaluation, May 2012, Istanbul, Turkey. 2012. 〈hal-00703125〉

Partager

Métriques

Consultations de la notice

228

Téléchargements de fichiers

204