Airbert: In-domain Pretraining for Vision-and-Language Navigation

Pierre-Louis Guhur; Makarand Tapaswi; Shizhe Chen; Ivan Laptev; Cordelia Schmid

Communication Dans Un Congrès Année : 2021

Airbert: In-domain Pretraining for Vision-and-Language Navigation

(1) , (2) , (1) , (1) , (1)

1
2

Pierre-Louis Guhur

Fonction : Auteur
PersonId : 1119663

Models of visual object recognition and scene understanding

Makarand Tapaswi

Fonction : Auteur

International Institute of Information Technology, Hyderabad [Hyderabad]

Shizhe Chen

Fonction : Auteur

Models of visual object recognition and scene understanding

Ivan Laptev

Fonction : Auteur

Models of visual object recognition and scene understanding

Cordelia Schmid

Fonction : Auteur

Models of visual object recognition and scene understanding

Résumé

Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in realistic environments using natural language instructions. Given the scarcity of domain-specific training data and the high diversity of image and language inputs, the generalization of VLN agents to unseen environments remains challenging. Recent methods explore pretraining to improve generalization, however, the use of generic image-caption datasets or existing smallscale VLN environments is suboptimal and results in limited improvements. In this work, we introduce BnB 1 , a largescale and diverse in-domain VLN dataset. We first collect image-caption (IC) pairs from hundreds of thousands of listings from online rental marketplaces. Using IC pairs we next propose automatic strategies to generate millions of VLN path-instruction (PI) pairs. We further propose a shuffling loss that improves the learning of temporal order inside PI pairs. We use BnB to pretrain our Airbert 2 model that can be adapted to discriminative and generative settings and show that it outperforms state of the art for Room-to-Room (R2R) navigation and Remote Referring Expression (REVERIE) benchmarks. Moreover, our in-domain pretraining significantly increases performance on a challenging few-shot VLN evaluation, where we train the model only on VLN instructions from a few houses.

Domaines

Intelligence artificielle [cs.AI]

Fichier principal

2108.09105.pdf (43.12 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Pierre-Louis Guhur : Connectez-vous pour contacter le contributeur

https://hal.science/hal-03470013

Soumis le : mercredi 8 décembre 2021-09:08:17

Dernière modification le : mardi 16 janvier 2024-16:28:53

Archivage à long terme le : mercredi 9 mars 2022-18:10:21

Dates et versions

hal-03470013 , version 1 (08-12-2021)

Identifiants

HAL Id : hal-03470013 , version 1

Citer

Pierre-Louis Guhur, Makarand Tapaswi, Shizhe Chen, Ivan Laptev, Cordelia Schmid. Airbert: In-domain Pretraining for Vision-and-Language Navigation. ICCV 2021 - International Conference on Computer Vision, Oct 2021, Virtual, France. ⟨hal-03470013⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-PARIS CNRS INRIA INRIA2 GENCI PSL ANR PRAIRIE-IA

49 Consultations

7 Téléchargements

Airbert: In-domain Pretraining for Vision-and-Language Navigation

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager