Audio-Visual Analysis In the Framework of Humans Interacting with Robots

Israel D. Gebru 1
1 PERCEPTION - Interpretation and Modelling of Images and Videos
Inria Grenoble - Rhône-Alpes, LJK - Laboratoire Jean Kuntzmann, INPG - Institut National Polytechnique de Grenoble
Abstract : In recent years, there has been a growing interest in human-robot interaction (HRI), with the aim to enable robots to naturally interact and communicate with humans. Natural interaction implies that robots not only need to understand speech and non-verbal communication cues such as body gesture, gaze, or facial expressions, but they also need to understand the dynamics of the social interplay, e.g. find people in the environment, distinguish between different people, track them through the physical space, parse their actions and activities, estimate their engagement, identify who is speaking, who speaks to whom, etc. All these task necessitate the robots to have multimodal perception skills to meaningfully detect and integrate information from their multiple sensory channels. In this thesis, we focus on the robot’s audio-visual sensory inputs consisting of microphones and video cameras. Among the different addressable perception tasks, in this thesis we explore three, namely; (1) multiple speakers localization, (2) multiple-person location tracking, and (3) speaker diarization. The majority of existing works in signal processing and computer vision address these problems by utilizing either audio signals or visual information. However, in this thesis, we address them via fusion of the audio and visual information gathered by two microphones and one video camera. Our goal is to exploit the complimentary nature of the audio and visual modalities with a hope of attaining significant improvements on robustness and performance over systems that use a single modality. Moreover, the three problems are addressed considering challenging HRI scenarios such as a robot engaged in a multi-party interaction with varying number of participants, which may speak at the same time as well as may move around the scene and turn their heads/faces towards the other participants rather than facing the robot.
Complete list of metadatas

Cited literature [152 references]  Display  Hide  Download

https://hal.inria.fr/tel-01774233
Contributor : Team Perception <>
Submitted on : Monday, April 23, 2018 - 2:54:31 PM
Last modification on : Monday, April 30, 2018 - 3:22:41 PM
Long-term archiving on : Tuesday, September 18, 2018 - 10:36:53 PM

File

main.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : tel-01774233, version 1

Collections

Citation

Israel D. Gebru. Audio-Visual Analysis In the Framework of Humans Interacting with Robots. Computer Vision and Pattern Recognition [cs.CV]. Université Grenoble Alpes, 2018. English. ⟨tel-01774233⟩

Share

Metrics

Record views

363

Files downloads

262