Automated audio captioning by fine-tuning bart with audioset tags - Archive ouverte HAL Access content directly
Conference Papers Year :

Automated audio captioning by fine-tuning bart with audioset tags

(1) , (1) , (2)
1
2

Abstract

Automated audio captioning is the multimodal task of describing environmental audio recordings with fluent natural language. Most current methods utilize pre-trained analysis models to extract relevant semantic content from the audio input. However, prior information on language modeling is rarely introduced, and corresponding architectures are limited in capacity due to data scarcity. In this paper, we present a method leveraging the linguistic information contained in BART, a large-scale conditional language model with general purpose pre-training. The caption generation is conditioned on sequences of textual AudioSet tags. This input is enriched with temporally aligned audio embeddings that allows the model to improve the sound event recognition. The full BART architecture is fine-tuned with few additional parameters. Experimental results demonstrate that, beyond the scaling properties of the architecture, language-only pre-training improves the text quality in the multimodal setting of audio captioning. The best model achieves stateof-the-art performance on AudioCaps with 46.5 SPIDEr.
Fichier principal
Vignette du fichier
DCASE2021Workshop_Gontier_57.pdf (1.71 Mo) Télécharger le fichier
Origin : Files produced by the author(s)

Dates and versions

hal-03522488 , version 1 (12-01-2022)

Identifiers

  • HAL Id : hal-03522488 , version 1

Cite

Félix Gontier, Romain Serizel, Christophe Cerisara. Automated audio captioning by fine-tuning bart with audioset tags. DCASE 2021 - 6th Workshop on Detection and Classification of Acoustic Scenes and Events, Nov 2021, Virtual, Spain. ⟨hal-03522488⟩
145 View
112 Download

Share

Gmail Facebook Twitter LinkedIn More