Skip to Main content Skip to Navigation
Journal articles

Large Linguistic Corpus Reduction with SCP algorithms

Abstract : For pratical reasons (mainly related to costs), and to meet the quality expectations of the associated application, corpus design is a crucial issue for building rich annotated text corpora. Reducing a large corpus while maintaining sufficient linguistic richness can be formalized as a Set Covering Problem (SCP). Within this context, we present in this paper two algorithmic heuristics applied to design large text corpora in English and French and covering multi-represented phonological units. The first considered algorithm is a standard greedy solution with an agglomerative/spitting strategy. We propose a second algorithm based on Lagrangian relaxation. This approach provides a lower bound concerning the cost of each covering solution. This lower bound can be used as a metric to evaluate the quality of a reduced corpus whatever the algorithm applied. Experiments show that a suboptimal algorithm like a Greedy achieves good results; the cost of its solutions is not so far from the lower bound (about 4.35% for the triphoneme coverings). Usually constraints on SCP are binary, we proposed here a generalization where the constraint on each covering feature can be multi-valued.
Document type :
Journal articles
Complete list of metadata
Contributor : Expression Irisa Connect in order to contact the contributor
Submitted on : Tuesday, March 24, 2015 - 4:37:40 PM
Last modification on : Tuesday, October 19, 2021 - 11:58:58 PM



Nelly Barbot, Olivier Boëffard, Jonathan Chevelu, Arnaud Delhay. Large Linguistic Corpus Reduction with SCP algorithms. Computational Linguistics, Massachusetts Institute of Technology Press (MIT Press), 2015, 41 (3), pp.30. ⟨10.1162/COLIa00225⟩. ⟨hal-01135089⟩



Les métriques sont temporairement indisponibles