Cleaning noisy wordnets

Abstract : Automatic approaches to creating and extending wordnets, which have become very popular in the past decade, inadvertently result in noisy synsets. This is why we propose an approach to detect synset outliers in order to eliminate the noise and improve accuracy of the developed wordnets, so that they become more useful lexico-semantic resources for natural language applications. The approach compares the words that appear in the synset and its surroundings with the contexts of the literals in question they are used in based on large monolingual corpora. By fine-tuning the outlier threshold we can influence how many outlier candidates will be eliminated. Although the proposed approach is language-independent we test it on Slovene and French that were created automatically from bilingual resources and contain plenty of disambiguation errors. Manual evaluation of the results shows that by applying a threshold similar to the estimated error rate in the respective wordnets, 67% of the proposed outlier candidates are indeed incorrect for French and a 64% for Slovene. This is a big improvement compared to the estimated overall error rates in the resources, which are 12% for French and 15% for Slovene.
Document type :
Conference papers
Complete list of metadatas

https://hal.inria.fr/hal-00703125
Contributor : Benoît Sagot <>
Submitted on : Thursday, May 31, 2012 - 9:35:53 PM
Last modification on : Thursday, August 29, 2019 - 2:24:09 PM
Long-term archiving on : Thursday, December 15, 2016 - 9:20:17 AM

File

LREC2012-sagot_fiser-published...
Files produced by the author(s)

Identifiers

  • HAL Id : hal-00703125, version 1

Collections

Citation

Benoît Sagot, Darja Fišer. Cleaning noisy wordnets. LREC 2012 - Eighth International Conference on Language Resources and Evaluation, May 2012, Istanbul, Turkey. ⟨hal-00703125⟩

Share

Metrics

Record views

264

Files downloads

284