X. Alameda-pineda, V. Khalidov, . Horaud, and . Forbes, Finding audio-visual events in informal social gatherings, Proceedings of the 13th international conference on multimodal interfaces, ICMI '11
DOI : 10.1145/2070481.2070527
URL : https://hal.archives-ouvertes.fr/inria-00623489

E. Arnaud, Y. Christensen, . Lu, . Barker, . Khalidov et al., The CAVA corpus, Proceedings of the 10th international conference on Multimodal interfaces, IMCI '08, 2008.
DOI : 10.1145/1452392.1452414
URL : https://hal.archives-ouvertes.fr/inria-00373173

E. Bailly-baillire, S. Bengio, . Bimbot, . Hamouz, . Kittler et al., The BANCA Database and Evaluation Protocol, Proceedings of the International Conference on Audio and Video-Based Biometric Person Authentication, pp.625-638, 2003.
DOI : 10.1007/3-540-44887-X_74

J. Bouguet, Camera calibration toolbox for, Matlab, 2008.

M. Brookes, Voicebox: Speech processing toolbox for matlab

H. Brugman, . Russel, and . Nijmegen, Annotating multimedia / multimodal resources with ELAN, Proceedings of the International Conference on Language Resources and Evaluation, pp.2065-2068, 2004.

J. Cech, J. Sanchez-riera, and R. Horaud, Scene flow estimation by growing correspondence seeds, CVPR 2011, 2011.
DOI : 10.1109/CVPR.2011.5995442
URL : https://hal.archives-ouvertes.fr/inria-00590274

E. Cherry, Some Experiments on the Recognition of Speech, with One and with Two Ears, The Journal of the Acoustical Society of America, vol.25, issue.5, pp.975-979, 1953.
DOI : 10.1121/1.1907229

M. Cooke, . Barker, X. Cunningham, and . Shao, An audiovisual corpus for speech perception and automatic speech recognition (l), Speech Communication, vol.49, issue.5, pp.384-401, 2007.

N. Dalal and . Triggs, Histograms of Oriented Gradients for Human Detection, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), 2005.
DOI : 10.1109/CVPR.2005.177
URL : https://hal.archives-ouvertes.fr/inria-00548512

L. Gorelick, . Blank, . Shechtman, R. Irani, and . Basri, Actions as Space-Time Shapes, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.29, issue.12, pp.2247-2253, 2007.
DOI : 10.1109/TPAMI.2007.70711

M. Hansard and R. Horaud, Cyclopean geometry of binocular vision, Journal of the Optical Society of America A, vol.25, issue.9, p.23572369, 2008.
DOI : 10.1364/JOSAA.25.002357
URL : https://hal.archives-ouvertes.fr/inria-00435548

R. Hartley and . Zisserman, Multiple View Geometry in Computer Vision, second edn, p.521540518, 2004.

S. Haykin and . Chen, The Cocktail Party Problem, Neural Computation, vol.31, issue.2, pp.1875-1902, 2005.
DOI : 10.1016/0378-5955(91)90148-3

T. J. Hazen, C. Saenko, J. La, and . Glass, A segmentbased audio-visual speech recognizer: data collection, development , and initial experiments, Proceedings of the ACM International Conference on Multimodal Interfaces , ICMI '04, pp.235-242, 2004.
DOI : 10.1145/1027933.1027972

M. Hoai, F. Zhong-lan, and . De-la-torre, Joint segmentation and classification of human actions in video, CVPR 2011, 2011.
DOI : 10.1109/CVPR.2011.5995470

Z. Kalal, J. Mikolajczyk, and . Matas, Tracking-Learning-Detection, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.34, issue.7, pp.1409-1422, 2012.
DOI : 10.1109/TPAMI.2011.239

V. Khalidov, . Forbes, and . Horaud, Conjugate Mixture Models for Clustering Multimodal Data, Neural Computation, vol.49, issue.3, pp.517-557, 2011.
DOI : 10.1007/978-94-011-3436-1
URL : https://hal.archives-ouvertes.fr/inria-00590267

H. Kim, M. Choi, and . Kim, Human-robot interaction in real environments by audio-visual integration, International Journal of Control, Automation and Systems, vol.5, issue.1, pp.61-69, 2007.

I. Laptev, On Space-Time Interest Points, International Journal of Computer Vision, vol.17, issue.8, pp.107-123, 2005.
DOI : 10.1007/s11263-005-1838-7
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.58.1419

I. Laptev, C. Marszalek, . Schmid, and . Rozenfeld, Learning realistic human actions from movies, 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008.
DOI : 10.1109/CVPR.2008.4587756
URL : https://hal.archives-ouvertes.fr/inria-00548659

G. Lathoud, D. Odobez, and . Gatica-pérez, AV16.3: An Audio-Visual Corpus for Speaker Localization and Tracking, Proceedings of the Workshop on Machine Learning and Multimodal Interaction, 2005.
DOI : 10.1007/978-3-540-30568-2_16

J. Liu, M. Luo, and . Shah, Recognizing realistic actions from videos " in the wild, Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, 2009.

R. Luo and . Kay, Multisensor integration and fusion in intelligent systems, IEEE Transactions on Systems, Man, and Cybernetics, vol.19, issue.5, pp.901-931, 1989.
DOI : 10.1109/21.44007

S. Marcel, . Mccool, . Matejka, J. Ahonen, and . Cernocky, Mobile biometry (MOBIO) face and speaker verification evaluation. Idiap-RR Idiap, 2010.
DOI : 10.1007/978-3-642-17711-8_22
URL : https://hal.archives-ouvertes.fr/hal-01318429

M. Marszalek, C. Laptev, and . Schmid, Actions in context, 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009.
DOI : 10.1109/CVPR.2009.5206557
URL : https://hal.archives-ouvertes.fr/inria-00548645

K. Messer, J. Matas, K. Kittler, and . Jonsson, XM2VTSDB: The extended M2VTS database, Proceedings of the International Conference on Audio and Video-based Biometric Person Authentication, pp.72-77, 1999.

R. Messing, . Pal, and . Kautz, Activity recognition using the velocity histories of tracked keypoints, 2009 IEEE 12th International Conference on Computer Vision, 2009.
DOI : 10.1109/ICCV.2009.5459154

Y. Mohammad, Y. Xu, T. Matsumura, and . Nishida, The H3R explanation corpus human-human and base humanrobot interaction dataset, International Conference on Intelligent Sensors, Sensor Networks and Information Processing, pp.201-206, 2008.

E. Patterson, S. Gurbuz, J. Tufekci, and . Gowdy, CUAVE: A new audio-visual database for multimodal human-computer interface research, Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing, pp.2017-2020, 2002.

L. Rybok, U. Friedberger, R. Hanebeck, and . Stiefelhagen, The KIT Robo-kitchen data set for the evaluation of view-based activity recognition systems, 2011 11th IEEE-RAS International Conference on Humanoid Robots, 2011.
DOI : 10.1109/Humanoids.2011.6100854

C. Schüldt, . Laptev, and . Caputo, Recognizing human actions: a local SVM approach, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., pp.32-36, 2004.
DOI : 10.1109/ICPR.2004.1334462

Q. Shi, . Wang, A. Cheng, and . Smola, Human Action Segmentation and Recognition Using Discriminative Semi-Markov Models, International Journal of Computer Vision, vol.6, issue.4???5, pp.22-32, 2011.
DOI : 10.1007/s11263-010-0384-0
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.169.7159

M. Tenorth, M. Bandouch, and . Beetz, The TUM Kitchen Data Set of everyday manipulation activities for motion tracking and action recognition, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, 2009.
DOI : 10.1109/ICCVW.2009.5457583

S. Vedula, . Baker, . Rander, T. Collins, and . Kanade, Three-dimensional scene flow, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.27, issue.3, 2005.
DOI : 10.1109/iccv.1999.790293
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.23.3563

D. Weinland, E. Ronfard, and . Boyer, Free viewpoint action recognition using motion history volumes, Computer Vision and Image Understanding, vol.104, issue.2-3, pp.249-257, 2006.
DOI : 10.1016/j.cviu.2006.07.013
URL : https://hal.archives-ouvertes.fr/inria-00544629

G. Willems, J. Becker, and T. Tuytelaars, Exemplar-based Action Recognition in Video, Procedings of the British Machine Vision Conference 2009, 2009.
DOI : 10.5244/C.23.90

Z. Zivkovic, . Booij, . Krose, H. Topp, and . Christensen, From Sensors to Human Spatial Concepts: An Annotated Data Set, IEEE Transactions on Robotics, vol.24, issue.2, pp.501-505, 2008.
DOI : 10.1109/TRO.2008.918046