Large Linguistic Corpus Reduction with SCP algorithms

Nelly Barbot; Olivier Boëffard; Jonathan Chevelu; Arnaud Delhay

doi:10.1162/COLIa00225

Article Dans Une Revue Computational Linguistics Année : 2015

Large Linguistic Corpus Reduction with SCP algorithms

(1) , (2) , (1) , (1)

1
2

Nelly Barbot

Fonction : Auteur
PersonId : 740445
IdHAL : nelly-barbot
ORCID : 0000-0003-4177-9324

Expressiveness in Human Centered Data/Media

Olivier Boëffard

Fonction : Auteur
PersonId : 883118

Institut de Recherche en Informatique et Systèmes Aléatoires

Jonathan Chevelu

Fonction : Auteur
PersonId : 4560
IdHAL : jonathan-chevelu
IdRef : 156873885

Expressiveness in Human Centered Data/Media

Arnaud Delhay

Fonction : Auteur
PersonId : 5448
IdHAL : arnaud-delhay
ORCID : 0000-0001-6795-7999
IdRef : 122406354

Expressiveness in Human Centered Data/Media

Résumé

For pratical reasons (mainly related to costs), and to meet the quality expectations of the associated application, corpus design is a crucial issue for building rich annotated text corpora. Reducing a large corpus while maintaining sufficient linguistic richness can be formalized as a Set Covering Problem (SCP). Within this context, we present in this paper two algorithmic heuristics applied to design large text corpora in English and French and covering multi-represented phonological units. The first considered algorithm is a standard greedy solution with an agglomerative/spitting strategy. We propose a second algorithm based on Lagrangian relaxation. This approach provides a lower bound concerning the cost of each covering solution. This lower bound can be used as a metric to evaluate the quality of a reduced corpus whatever the algorithm applied. Experiments show that a suboptimal algorithm like a Greedy achieves good results; the cost of its solutions is not so far from the lower bound (about 4.35% for the triphoneme coverings). Usually constraints on SCP are binary, we proposed here a generalization where the constraint on each covering feature can be multi-valued.

Domaines

Informatique et langage [cs.CL]

Expression Irisa : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-01135089

Soumis le : mardi 24 mars 2015-16:37:40

Dernière modification le : mardi 3 octobre 2023-09:49:05

Dates et versions

hal-01135089 , version 1 (24-03-2015)

Identifiants

HAL Id : hal-01135089 , version 1
DOI : 10.1162/COLIa00225

Citer

Nelly Barbot, Olivier Boëffard, Jonathan Chevelu, Arnaud Delhay. Large Linguistic Corpus Reduction with SCP algorithms. Computational Linguistics, 2015, 41 (3), pp.30. ⟨10.1162/COLIa00225⟩. ⟨hal-01135089⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INSTITUT-TELECOM UNIV-RENNES1 CNRS INRIA INSA-RENNES ENSSAT IRISA CENTRALESUPELEC IRISA-D6 UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES INSA-GROUPE UR1-MATH-NUM

224 Consultations

0 Téléchargements

Large Linguistic Corpus Reduction with SCP algorithms

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager