Large Linguistic Corpus Reduction with SCP algorithms - Archive ouverte HAL Access content directly
Journal Articles Computational Linguistics Year : 2015

Large Linguistic Corpus Reduction with SCP algorithms

(1) , (2) , (1) , (1)


For pratical reasons (mainly related to costs), and to meet the quality expectations of the associated application, corpus design is a crucial issue for building rich annotated text corpora. Reducing a large corpus while maintaining sufficient linguistic richness can be formalized as a Set Covering Problem (SCP). Within this context, we present in this paper two algorithmic heuristics applied to design large text corpora in English and French and covering multi-represented phonological units. The first considered algorithm is a standard greedy solution with an agglomerative/spitting strategy. We propose a second algorithm based on Lagrangian relaxation. This approach provides a lower bound concerning the cost of each covering solution. This lower bound can be used as a metric to evaluate the quality of a reduced corpus whatever the algorithm applied. Experiments show that a suboptimal algorithm like a Greedy achieves good results; the cost of its solutions is not so far from the lower bound (about 4.35% for the triphoneme coverings). Usually constraints on SCP are binary, we proposed here a generalization where the constraint on each covering feature can be multi-valued.
Not file

Dates and versions

hal-01135089 , version 1 (24-03-2015)



Nelly Barbot, Olivier Boëffard, Jonathan Chevelu, Arnaud Delhay. Large Linguistic Corpus Reduction with SCP algorithms. Computational Linguistics, 2015, 41 (3), pp.30. ⟨10.1162/COLIa00225⟩. ⟨hal-01135089⟩
215 View
0 Download



Gmail Facebook Twitter LinkedIn More