Large Linguistic Corpus Reduction with SCP algorithms

Abstract : For pratical reasons (mainly related to costs), and to meet the quality expectations of the associated application, corpus design is a crucial issue for building rich annotated text corpora. Reducing a large corpus while maintaining sufficient linguistic richness can be formalized as a Set Covering Problem (SCP). Within this context, we present in this paper two algorithmic heuristics applied to design large text corpora in English and French and covering multi-represented phonological units. The first considered algorithm is a standard greedy solution with an agglomerative/spitting strategy. We propose a second algorithm based on Lagrangian relaxation. This approach provides a lower bound concerning the cost of each covering solution. This lower bound can be used as a metric to evaluate the quality of a reduced corpus whatever the algorithm applied. Experiments show that a suboptimal algorithm like a Greedy achieves good results; the cost of its solutions is not so far from the lower bound (about 4.35% for the triphoneme coverings). Usually constraints on SCP are binary, we proposed here a generalization where the constraint on each covering feature can be multi-valued.
Type de document :
Article dans une revue
Computational Linguistics, Massachusetts Institute of Technology Press (MIT Press), 2015, 41 (3), pp.30. 〈http://www.mitpressjournals.org/doi/pdf/10.1162/COLI_a_00225〉. 〈10.1162/COLI a 00225〉
Liste complète des métadonnées

https://hal.inria.fr/hal-01135089
Contributeur : Expression Irisa <>
Soumis le : mardi 24 mars 2015 - 16:37:40
Dernière modification le : jeudi 5 avril 2018 - 12:30:23

Identifiants

Citation

Nelly Barbot, Olivier Boëffard, Jonathan Chevelu, Arnaud Delhay. Large Linguistic Corpus Reduction with SCP algorithms. Computational Linguistics, Massachusetts Institute of Technology Press (MIT Press), 2015, 41 (3), pp.30. 〈http://www.mitpressjournals.org/doi/pdf/10.1162/COLI_a_00225〉. 〈10.1162/COLI a 00225〉. 〈hal-01135089〉

Partager

Métriques

Consultations de la notice

345