Skip to Main content Skip to Navigation
Conference papers

Episodic Transformer for Vision-and-Language Navigation

Alexander Pashevich 1 Cordelia Schmid 2, 3 Chen Sun 2, 4 
1 Thoth - Apprentissage de modèles à partir de données massives
Inria Grenoble - Rhône-Alpes, LJK - Laboratoire Jean Kuntzmann
3 WILLOW - Models of visual object recognition and scene understanding
DI-ENS - Département d'informatique - ENS Paris, Inria de Paris
Abstract : Interaction and navigation defined by natural language instructions in dynamic environments pose significant challenges for neural agents. This paper focuses on addressing two challenges: handling long sequence of subtasks, and understanding complex human instructions. We propose Episodic Transformer (E.T.), a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions. To improve training, we leverage synthetic instructions as an intermediate representation that decouples understanding the visual appearance of an environment from the variations of natural language instructions. We demonstrate that encoding the history with a transformer is critical to solve compositional tasks, and that pretraining and joint training with synthetic instructions further improve the performance. Our approach sets a new state of the art on the challenging ALFRED benchmark, achieving 38.4% and 8.5% task success rates on seen and unseen test splits.
Complete list of metadata
Contributor : Alexander Pashevich Connect in order to contact the contributor
Submitted on : Friday, October 8, 2021 - 10:53:26 PM
Last modification on : Friday, September 9, 2022 - 11:38:05 AM

Links full text


  • HAL Id : hal-03371803, version 1
  • ARXIV : 2105.06453



Alexander Pashevich, Cordelia Schmid, Chen Sun. Episodic Transformer for Vision-and-Language Navigation. ICCV 2021 - International Conference on Computer Vision, Oct 2021, Virtual, United States. pp.1-18. ⟨hal-03371803⟩



Record views