Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation

Shizhe Chen; Pierre-Louis Guhur; Makarand Tapaswi; Cordelia Schmid; Ivan Laptev

Communication Dans Un Congrès Année : 2022

Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation

(1) , (1) , (2) , (1) , (1)

1
2

Shizhe Chen

Fonction : Auteur
PersonId : 1119250

Models of visual object recognition and scene understanding

Pierre-Louis Guhur

Fonction : Auteur

Models of visual object recognition and scene understanding

Makarand Tapaswi

Fonction : Auteur
PersonId : 1062676

International Institute of Information Technology, Hyderabad [Hyderabad]

Cordelia Schmid

Fonction : Auteur
PersonId : 831154

Models of visual object recognition and scene understanding

Ivan Laptev

Fonction : Auteur
PersonId : 865349

Models of visual object recognition and scene understanding

Résumé

Following language instructions to navigate in unseen environments is a challenging problem for autonomous embodied agents. The agent not only needs to ground languages in visual scenes, but also should explore the environment to reach its target. In this work, we propose a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding. We build a topological map on-the-fly to enable efficient exploration in global action space. To balance the complexity of large action space reasoning and fine-grained language grounding, we dynamically combine a fine-scale encoding over local observations and a coarse-scale encoding on a global map via graph transformers. The proposed approach, DUET, significantly outperforms state-of-the-art methods on goal-oriented vision-and-language navigation (VLN) benchmarks REVERIE and SOON. It also improves the success rate on the fine-grained VLN benchmark R2R.

Mots clés

Vision-and-Language Navigation Transformer

Domaines

Intelligence artificielle [cs.AI] Vision par ordinateur et reconnaissance de formes [cs.CV] Apprentissage [cs.LG]

Fichier principal

03269.pdf (1.59 Mo)

03269-supp.pdf (2.81 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Shizhe Chen : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03696868

Soumis le : jeudi 16 juin 2022-11:48:34

Dernière modification le : vendredi 19 avril 2024-16:18:58

Dates et versions

hal-03696868 , version 1 (16-06-2022)

Identifiants

HAL Id : hal-03696868 , version 1

Citer

Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev. Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation. CVPR 2022 - IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 2022, New Orleans, United States. ⟨hal-03696868⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-PARIS CNRS INRIA INRIA2 GENCI PSL ANR PRAIRIE-IA

35 Consultations

61 Téléchargements

Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager