Large Linguistic Corpus Reduction with SCP algorithms

Abstract : For pratical reasons (mainly related to costs), and to meet the quality expectations of the associated application, corpus design is a crucial issue for building rich annotated text corpora. Reducing a large corpus while maintaining sufficient linguistic richness can be formalized as a Set Covering Problem (SCP). Within this context, we present in this paper two algorithmic heuristics applied to design large text corpora in English and French and covering multi-represented phonological units. The first considered algorithm is a standard greedy solution with an agglomerative/spitting strategy. We propose a second algorithm based on Lagrangian relaxation. This approach provides a lower bound concerning the cost of each covering solution. This lower bound can be used as a metric to evaluate the quality of a reduced corpus whatever the algorithm applied. Experiments show that a suboptimal algorithm like a Greedy achieves good results; the cost of its solutions is not so far from the lower bound (about 4.35% for the triphoneme coverings). Usually constraints on SCP are binary, we proposed here a generalization where the constraint on each covering feature can be multi-valued.
Document type :
Journal articles
Complete list of metadatas

https://hal.inria.fr/hal-01135089
Contributor : Expression Irisa <>
Submitted on : Tuesday, March 24, 2015 - 4:37:40 PM
Last modification on : Friday, November 16, 2018 - 1:25:15 AM

Identifiers

Citation

Nelly Barbot, Olivier Boëffard, Jonathan Chevelu, Arnaud Delhay. Large Linguistic Corpus Reduction with SCP algorithms. Computational Linguistics, Massachusetts Institute of Technology Press (MIT Press), 2015, 41 (3), pp.30. ⟨http://www.mitpressjournals.org/doi/pdf/10.1162/COLI_a_00225⟩. ⟨10.1162/COLI a 00225⟩. ⟨hal-01135089⟩

Share

Metrics

Record views

421