CoVR: Learning Composed Video Retrieval from Web Video Captions

Lucas Ventura; Antoine Yang; Cordelia Schmid; Gül Varol

Pré-Publication, Document De Travail Année : 2023

CoVR: Learning Composed Video Retrieval from Web Video Captions

(1, 2, 3, 4) , (5, 4) , (5, 4) , (1)

1
2
3
4
5

Lucas Ventura

Fonction : Auteur
PersonId : 1320557
IdHAL : lucas-ventura
ORCID : 0000-0001-5795-0064

Laboratoire d'Informatique Gaspard-Monge

École des Ponts ParisTech

IMAGINE [Marne-la-Vallée]

Models of visual object recognition and scene understanding

Antoine Yang

Fonction : Auteur

Département d'informatique - ENS Paris

Models of visual object recognition and scene understanding

Cordelia Schmid

Fonction : Auteur

Département d'informatique - ENS Paris

Models of visual object recognition and scene understanding

Gül Varol

Fonction : Auteur

Laboratoire d'Informatique Gaspard-Monge

Résumé

Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs, while also expanding the scope of the task to include composed video retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. Our experiments further demonstrate that training a CoVR model on our dataset effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on both the CIRR and FashionIQ benchmarks. Our code, datasets, and models are publicly available at https://imagine.enpc.fr/~ventural/covr.

Mots clés

Computer vision

Domaines

Intelligence artificielle [cs.AI]

Fichier principal

2308.14746.pdf (10.96 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Lucas Ventura : Connectez-vous pour contacter le contributeur

https://hal.science/hal-04327307

Soumis le : mercredi 6 décembre 2023-15:48:43

Dernière modification le : samedi 20 avril 2024-03:37:00

Dates et versions

hal-04327307 , version 1 (06-12-2023)

Identifiants

HAL Id : hal-04327307 , version 1
ARXIV : 2308.14746

Citer

Lucas Ventura, Antoine Yang, Cordelia Schmid, Gül Varol. CoVR: Learning Composed Video Retrieval from Web Video Captions. 2023. ⟨hal-04327307⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-PARIS ENPC CNRS INRIA LIGM_A3SI PARISTECH LIGM INRIA2 GENCI PSL ANR UNIV-EIFFEL JSE2024

185 Consultations

12 Téléchargements

CoVR: Learning Composed Video Retrieval from Web Video Captions

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager