Large-scale Learning from Video and Natural Language

Antoine Miech

Thèse Année : 2020

Large-scale Learning from Video and Natural Language

Apprentissage vidéo et langage naturel à grande échelle

(1)

Antoine Miech

Fonction : Auteur
PersonId : 1041372

Models of visual object recognition and scene understanding

Résumé

The goal of this thesis is to build and train machine learning models capable of understanding the content of videos. Current video understanding approaches mainly rely on large-scale manually annotated video datasets for training. However, collecting and annotating such dataset is cumbersome, expensive and time-consuming. To address this issue, this thesis focuses on leveraging large amounts of readily-available, but noisy annotations in the form of natural language. In particular, we exploit a diverse corpus of textual metadata such as movie scripts, web video titles and descriptions or automatically transcribed speech obtained from narrated videos. Training video models on such readily-available textual data is challenging as such annotation is often imprecise or wrong. In this thesis, we introduce learning approaches to deal with weak annotation and design specialized training objectives and neural network architectures.

Nous nous intéressons à l’apprentissage automatique d’algorithmes pour la compréhension automatique de vidéos. Une majorité des approaches en compréhension de vidéos dépend de large base de données de vidéos manuellement annotées pour l’entraînement. Cependant, la collection et l’annotation de telles base de données est fastidieuse, coûte cher et prend du temps. Pour palier à ce problème, cette thèse se concentre sur l’exploitation de large quantité d’annotations publiquement disponible, cependant bruitées, sous forme de language naturel. En particulier, nous nous intéressons à un corpus divers de métadonnées textuelles incluant des scripts de films, des titres et descriptions de vidéos internet ou encore des transcriptions de paroles. L’usage de ce type de données publiquement disponibles est difficile car l’annotation y est faible. Pour cela, nous introduisons différentes approches d’apprentissage telles que de nouvelles fonctions de coûts ou architectures de réseaux de neurones, adaptées à de faibles annotations.

Mots clés

Computer vision Computer vision and image understanding Video analysis Video and language Weakly-supervised learning Deep learning Machine learning

Vision artificielle Vision par ordinateur Analyse de vidéo Vidéo et language Apprentissage faiblement supervisé Apprentis-age profond Apprentissage machine

Domaines

Vision par ordinateur et reconnaissance de formes [cs.CV]

Fichier principal

main.pdf (69.81 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Antoine Miech : Connectez-vous pour contacter le contributeur

https://inria.hal.science/tel-03084216

Soumis le : dimanche 20 décembre 2020-21:43:25

Dernière modification le : vendredi 19 avril 2024-16:18:58

Archivage à long terme le : dimanche 21 mars 2021-18:19:17

Dates et versions

tel-03084216 , version 1 (20-12-2020)

Identifiants

HAL Id : tel-03084216 , version 1

Citer

Antoine Miech. Large-scale Learning from Video and Natural Language. Computer Vision and Pattern Recognition [cs.CV]. PSL Research University, 2020. English. ⟨NNT : ⟩. ⟨tel-03084216⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-PARIS CNRS INRIA THESES-ENS INRIA2 PSL

212 Consultations

14 Téléchargements

Large-scale Learning from Video and Natural Language

Apprentissage vidéo et langage naturel à grande échelle

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager