VidChapters-7M: Video Chapters at Scale

Antoine Yang; Arsha Nagrani; Ivan Laptev; Josef Sivic; Cordelia Schmid

Communication Dans Un Congrès Année : 2023

VidChapters-7M: Video Chapters at Scale

(1, 2) , (3) , (1, 2) , (4) , (1, 2)

1
2
3
4

Antoine Yang

Fonction : Auteur
PersonId : 748008
IdHAL : antoine-yang
ORCID : 0000-0002-7258-571X

Département d'informatique - ENS Paris

Models of visual object recognition and scene understanding

Arsha Nagrani

Fonction : Auteur

Visual Geometry Group

Ivan Laptev

Fonction : Auteur
PersonId : 865349

Département d'informatique - ENS Paris

Models of visual object recognition and scene understanding

Josef Sivic

Fonction : Auteur
PersonId : 945630

Czech Institute of Informatics, Robotics and Cybernetics [Prague]

Cordelia Schmid

Fonction : Auteur
PersonId : 831154

Département d'informatique - ENS Paris

Models of visual object recognition and scene understanding

Résumé

Segmenting long videos into chapters enables users to quickly navigate to the information of their interest. This important topic has been understudied due to the lack of publicly released datasets. To address this issue, we present VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters in total. VidChapters-7M is automatically created from videos online in a scalable manner by scraping user-annotated chapters and hence without any additional manual annotation. We introduce the following three tasks based on this data. First, the video chapter generation task consists of temporally segmenting the video and generating a chapter title for each segment. To further dissect the problem, we also define two variants of this task: video chapter generation given ground-truth boundaries, which requires generating a chapter title given an annotated video segment, and video chapter grounding, which requires temporally localizing a chapter given its annotated title. We benchmark both simple baselines and state-of-the-art video-language models for these three tasks. We also show that pretraining on VidChapters-7M transfers well to dense video captioning tasks in both zero-shot and finetuning settings, largely improving the state of the art on the YouCook2 and ViTT benchmarks. Finally, our experiments reveal that downstream performance scales well with the size of the pretraining dataset.

Mots clés

Video Understanding Video Chapter Generation Video Chapter Grounding Dense Video Captioning Pretraining Language Model Zero-Shot Learning Computer Vision Vision and Language Data

Domaines

Informatique [cs] Intelligence artificielle [cs.AI] Informatique et langage [cs.CL] Vision par ordinateur et reconnaissance de formes [cs.CV] Apprentissage [cs.LG]

Fichier principal

view.pdf (8.49 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Antoine Yang : Connectez-vous pour contacter le contributeur

https://hal.science/hal-04217697

Soumis le : mardi 26 septembre 2023-08:23:24

Dernière modification le : vendredi 19 avril 2024-16:18:56

Archivage à long terme le : mercredi 27 décembre 2023-18:25:05

Dates et versions

hal-04217697 , version 1 (26-09-2023)

Identifiants

HAL Id : hal-04217697 , version 1
ARXIV : 2309.13952

Citer

Antoine Yang, Arsha Nagrani, Ivan Laptev, Josef Sivic, Cordelia Schmid. VidChapters-7M: Video Chapters at Scale. NeurIPS 2023 - Conference on Neural Information Processing Systems - Track on Datasets and Benchmarks, Dec 2023, New Orleans (LA), United States. ⟨hal-04217697⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-PARIS CNRS INRIA INRIA2 GENCI PSL ANR PRAIRIE-IA

121 Consultations

32 Téléchargements

VidChapters-7M: Video Chapters at Scale

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager