Skip to Main content Skip to Navigation
New interface

Audio-Visual Analysis In the Framework of Humans Interacting with Robots

Israel Dejene Gebru 1 
1 PERCEPTION - Interpretation and Modelling of Images and Videos
Inria Grenoble - Rhône-Alpes, Grenoble INP - Institut polytechnique de Grenoble - Grenoble Institute of Technology, LJK - Laboratoire Jean Kuntzmann
Abstract : In recent years, there has been a growing interest in human-robot interaction (HRI), with the aim to enable robots to naturally interact and communicate with humans. Natural interaction implies that robots not only need to understand speech and non-verbal communication cues such as body gesture, gaze, or facial expressions, but they also need to understand the dynamics of the social interplay, e.g. find people in the environment, distinguish between different people, track them through the physical space, parse their actions and activities, estimate their engagement, identify who is speaking, who speaks to whom, etc. All these task necessitate the robots to have multimodal perception skills to meaningfully detect and integrate information from their multiple sensory channels. In this thesis, we focus on the robot’s audio-visual sensory inputs consisting of microphones and video cameras. Among the different addressable perception tasks, in this thesis we explore three, namely; (1) multiple speakers localization, (2) multiple-person location tracking, and (3) speaker diarization. The majority of existing works in signal processing and computer vision address these problems by utilizing either audio signals or visual information. However, in this thesis, we address them via fusion of the audio and visual information gathered by two microphones and one video camera. Our goal is to exploit the complimentary nature of the audio and visual modalities with a hope of attaining significant improvements on robustness and performance over systems that use a single modality. Moreover, the three problems are addressed considering challenging HRI scenarios such as a robot engaged in a multi-party interaction with varying number of participants, which may speak at the same time as well as may move around the scene and turn their heads/faces towards the other participants rather than facing the robot.
Complete list of metadata

Cited literature [152 references]  Display  Hide  Download
Contributor : Perception team Connect in order to contact the contributor
Submitted on : Monday, April 23, 2018 - 2:54:31 PM
Last modification on : Friday, March 25, 2022 - 9:42:10 AM
Long-term archiving on: : Tuesday, September 18, 2018 - 10:36:53 PM


Files produced by the author(s)


  • HAL Id : tel-01774233, version 1



Israel Dejene Gebru. Audio-Visual Analysis In the Framework of Humans Interacting with Robots. Computer Vision and Pattern Recognition [cs.CV]. Université Grenoble Alpes, 2018. English. ⟨NNT : ⟩. ⟨tel-01774233⟩



Record views


Files downloads