Large Linguistic Corpus Reduction with SCP algorithms - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Article Dans Une Revue Computational Linguistics Année : 2015

Large Linguistic Corpus Reduction with SCP algorithms

Résumé

For pratical reasons (mainly related to costs), and to meet the quality expectations of the associated application, corpus design is a crucial issue for building rich annotated text corpora. Reducing a large corpus while maintaining sufficient linguistic richness can be formalized as a Set Covering Problem (SCP). Within this context, we present in this paper two algorithmic heuristics applied to design large text corpora in English and French and covering multi-represented phonological units. The first considered algorithm is a standard greedy solution with an agglomerative/spitting strategy. We propose a second algorithm based on Lagrangian relaxation. This approach provides a lower bound concerning the cost of each covering solution. This lower bound can be used as a metric to evaluate the quality of a reduced corpus whatever the algorithm applied. Experiments show that a suboptimal algorithm like a Greedy achieves good results; the cost of its solutions is not so far from the lower bound (about 4.35% for the triphoneme coverings). Usually constraints on SCP are binary, we proposed here a generalization where the constraint on each covering feature can be multi-valued.
Fichier non déposé

Dates et versions

hal-01135089 , version 1 (24-03-2015)

Identifiants

Citer

Nelly Barbot, Olivier Boëffard, Jonathan Chevelu, Arnaud Delhay. Large Linguistic Corpus Reduction with SCP algorithms. Computational Linguistics, 2015, 41 (3), pp.30. ⟨10.1162/COLIa00225⟩. ⟨hal-01135089⟩
224 Consultations
0 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More