Self-Supervised Video Representation Learning via Latent Time Navigation

Di Yang; Yaohui Wang; Quan Kong; Antitza Dantcheva; Lorenzo Garattoni; Gianpiero Francesca; Francois F Bremond

doi:10.1609/aaai.v37i3.25416

Communication Dans Un Congrès Année : 2023

Self-Supervised Video Representation Learning via Latent Time Navigation

(1, 2) , (1, 2) , (3) , (1, 2) , (4) , (4) , (1, 2)

1
2
3
4

Di Yang

Fonction : Auteur
PersonId : 1089515

Spatio-Temporal Activity Recognition Systems

Université Côte d'Azur

Yaohui Wang

Fonction : Auteur

Spatio-Temporal Activity Recognition Systems

Université Côte d'Azur

Quan Kong

Fonction : Auteur

Woven Planet Holdings

Antitza Dantcheva

Fonction : Auteur

Spatio-Temporal Activity Recognition Systems

Université Côte d'Azur

Lorenzo Garattoni

Fonction : Auteur

Toyota Motor Europe

Gianpiero Francesca

Fonction : Auteur

Toyota Motor Europe

Francois F Bremond

Fonction : Auteur
PersonId : 20805
IdHAL : francois-bremond
ORCID : 0000-0003-2988-2142
IdRef : 138919046

Spatio-Temporal Activity Recognition Systems

Université Côte d'Azur

Résumé

Self-supervised video representation learning aimed at maximizing similarity between different temporal segments of one video, in order to enforce feature persistence over time. This leads to loss of pertinent information related to temporal relationships, rendering actions such as `enter' and `leave' to be indistinguishable. To mitigate this limitation, we propose Latent Time Navigation (LTN), a time-parameterized contrastive learning strategy that is streamlined to capture fine-grained motions. Specifically, we maximize the representation similarity between different video segments from one video, while maintaining their representations time-aware along a subspace of the latent representation code including an orthogonal basis to represent temporal changes. Our extensive experimental analysis suggests that learning video representations by LTN consistently improves performance of action classification in fine-grained and human-oriented tasks (e.g., on Toyota Smarthome dataset). In addition, we demonstrate that our proposed model, when pre-trained on Kinetics-400, generalizes well onto the unseen real world video benchmark datasets UCF101 and HMDB51, achieving state-of-the-art performance in action recognition.

Domaines

Vision par ordinateur et reconnaissance de formes [cs.CV] Intelligence artificielle [cs.AI]

Fichier principal

25416-Article Text-29479-1-2-20230626_Self-supervised_video_representaiton_learning_AAAI2023.pdf (1.9 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Di Yang : Connectez-vous pour contacter le contributeur

https://hal.science/hal-04236128

Soumis le : mardi 10 octobre 2023-18:01:22

Dernière modification le : lundi 26 février 2024-11:22:14

Dates et versions

hal-04236128 , version 1 (10-10-2023)

Identifiants

HAL Id : hal-04236128 , version 1
ARXIV : 2305.06437
DOI : 10.1609/aaai.v37i3.25416

Citer

Di Yang, Yaohui Wang, Quan Kong, Antitza Dantcheva, Lorenzo Garattoni, et al.. Self-Supervised Video Representation Learning via Latent Time Navigation. AAAI 2023 - AAAI Conference on Artificial Intelligence, Feb 2023, Washigton, D.C., United States. ⟨10.1609/aaai.v37i3.25416⟩. ⟨hal-04236128⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INRIA INRIA2 GENCI UNIV-COTEDAZUR 3IA-COTEDAZUR ANR

13 Consultations

16 Téléchargements

Self-Supervised Video Representation Learning via Latent Time Navigation

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager