Automated audio captioning by fine-tuning bart with audioset tags

Félix Gontier; Romain Serizel; Christophe Cerisara

Communication Dans Un Congrès Année : 2021

Automated audio captioning by fine-tuning bart with audioset tags

(1) , (1) , (2)

1
2

Félix Gontier

Fonction : Auteur
PersonId : 1122835

Speech Modeling for Facilitating Oral-Based Communication

Romain Serizel

Fonction : Auteur
PersonId : 10320
IdHAL : romain-serizel
IdRef : 223797391

Speech Modeling for Facilitating Oral-Based Communication

Christophe Cerisara

Fonction : Auteur
PersonId : 2353
IdHAL : christophe-cerisara
IdRef : 102700168

Natural Language Processing : representations, inference and semantics

Résumé

Automated audio captioning is the multimodal task of describing environmental audio recordings with fluent natural language. Most current methods utilize pre-trained analysis models to extract relevant semantic content from the audio input. However, prior information on language modeling is rarely introduced, and corresponding architectures are limited in capacity due to data scarcity. In this paper, we present a method leveraging the linguistic information contained in BART, a large-scale conditional language model with general purpose pre-training. The caption generation is conditioned on sequences of textual AudioSet tags. This input is enriched with temporally aligned audio embeddings that allows the model to improve the sound event recognition. The full BART architecture is fine-tuned with few additional parameters. Experimental results demonstrate that, beyond the scaling properties of the architecture, language-only pre-training improves the text quality in the multimodal setting of audio captioning. The best model achieves stateof-the-art performance on AudioCaps with 46.5 SPIDEr.

Mots clés

Audio captioning language models transfer learning BART audio tagging

Domaines

Son [cs.SD] Intelligence artificielle [cs.AI] Traitement du signal et de l'image [eess.SP] Traitement du texte et du document

Fichier principal

DCASE2021Workshop_Gontier_57.pdf (1.71 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Romain Serizel : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03522488

Soumis le : mercredi 12 janvier 2022-10:26:16

Dernière modification le : lundi 11 septembre 2023-17:41:18

Archivage à long terme le : mercredi 13 avril 2022-19:38:46

Dates et versions

hal-03522488 , version 1 (12-01-2022)

Identifiants

HAL Id : hal-03522488 , version 1

Citer

Félix Gontier, Romain Serizel, Christophe Cerisara. Automated audio captioning by fine-tuning bart with audioset tags. DCASE 2021 - 6th Workshop on Detection and Classification of Acoustic Scenes and Events, Nov 2021, Virtual, Spain. ⟨hal-03522488⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA UNIV-LORRAINE INRIA2 LORIA LORIA-NLPKD ANR

382 Consultations

329 Téléchargements

Automated audio captioning by fine-tuning bart with audioset tags

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager