DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon

Robin Algayres; Tristan Ricoul; Julien Karadayi; Hugo Laurençon; Salah Zaiem; Abdelrahman Mohamed; Benoît Sagot; Emmanuel Dupoux

doi:10.1162/tacl_a_00505

Article Dans Une Revue Transactions of the Association for Computational Linguistics Année : 2022

DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon

(1) , (2) , (2) , (2) , (2) , (3) , (4) , (2, 5, 3, 1)

1
2
3
4
5

Robin Algayres

Fonction : Auteur
PersonId : 1179577

Apprentissage machine et développement cognitif

Tristan Ricoul

Fonction : Auteur

École normale supérieure - Paris

Julien Karadayi

Fonction : Auteur

École normale supérieure - Paris

Hugo Laurençon

Fonction : Auteur

École normale supérieure - Paris

Salah Zaiem

Fonction : Auteur

École normale supérieure - Paris

Abdelrahman Mohamed

Fonction : Auteur

Meta AI Research [Paris]

Benoît Sagot

Fonction : Auteur
PersonId : 1461
IdHAL : bsagot
ORCID : 0000-0002-0107-8526
IdRef : 177454229

Automatic Language Modelling and ANAlysis & Computational Humanities

Emmanuel Dupoux

Fonction : Auteur

École normale supérieure - Paris

École des hautes études en sciences sociales

Meta AI Research [Paris]

Apprentissage machine et développement cognitif

Résumé

Finding word boundaries in continuous speech is challenging as there is little or no equivalent of a 'space' delimiter between words. Popular Bayesian non-parametric models for text segmentation (Goldwater et al., 2006, 2009) use a Dirichlet process to jointly segment sentences and build a lexicon of word types. We introduce DP-Parse, which uses similar principles but only relies on an instance lexicon of word tokens, avoiding the clustering errors that arise with a lexicon of word types. On the Zero Resource Speech Benchmark 2017, our model sets a new speech segmentation state-of-theart in 5 languages. The algorithm monotonically improves with better input representations, achieving yet higher scores when fed with weakly supervised inputs. Despite lacking a type lexicon, DP-Parse can be pipelined to a language model and learn semantic and syntactic representations as assessed by a new spoken word embedding benchmark. 1

Domaines

Linguistique Informatique Machine Learning [stat.ML]

Fichier principal

2206.11332.pdf (575.77 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Sabrina Zermani : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03831873

Soumis le : jeudi 27 octobre 2022-11:36:59

Dernière modification le : vendredi 19 avril 2024-16:18:57

Archivage à long terme le : samedi 28 janvier 2023-18:33:50

Dates et versions

hal-03831873 , version 1 (27-10-2022)

Identifiants

HAL Id : hal-03831873 , version 1
ARXIV : 2206.11332
DOI : 10.1162/tacl_a_00505

Citer

Robin Algayres, Tristan Ricoul, Julien Karadayi, Hugo Laurençon, Salah Zaiem, et al.. DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon. Transactions of the Association for Computational Linguistics, 2022, 10, pp.1051-1065. ⟨10.1162/tacl_a_00505⟩. ⟨hal-03831873⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-PARIS CNRS INRIA EHESS LSCP DEC INRIA2 PSL

53 Consultations

41 Téléchargements

DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager