X. Alameda-pineda, V. Khalidov, R. Horaud, and F. Forbes, Finding audio-visual events in informal social gatherings, Proceedings of the 13th international conference on multimodal interfaces, ICMI '11, 2011.
DOI : 10.1145/2070481.2070527
URL : https://hal.archives-ouvertes.fr/inria-00623489

M. J. Beal, H. Attias, and N. Jojic, Audio-visual sensor fusion with probabilistic graphical models, ECCV, 2002.

Y. Chan, W. Tsui, H. So, and P. Ching, Time-of-arrival based localization under NLOS conditions Vehicular Technology, IEEE Transactions on, vol.55, issue.1, pp.17-24, 2006.

J. W. Fisher and T. Darrel, Speaker Association With Signal-Level Audiovisual Fusion, IEEE Transactions on Multimedia, 2004.
DOI : 10.1109/TMM.2004.827503

M. Hansard and R. Horaud, Cyclopean geometry of binocular vision, Journal of the Optical Society of America A, vol.25, issue.9, p.23572369, 2008.
DOI : 10.1364/JOSAA.25.002357
URL : https://hal.archives-ouvertes.fr/inria-00435548

T. Itohara, T. Otsuka, T. Mizumoto, T. Ogata, and H. G. Okuno, Particle-filter based audio-visual beat-tracking for music robot ensemble with human guitarist, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2011.
DOI : 10.1109/IROS.2011.6094773

V. Khalidov, F. Forbes, and R. Horaud, Conjugate Mixture Models for Clustering Multimodal Data, Neural Computation, pp.517-557, 2011.
DOI : 10.1007/978-94-011-3436-1
URL : https://hal.archives-ouvertes.fr/inria-00590267

V. Khalidov, F. Forbes, and R. P. Horaud, Calibration of a binocularbinaural sensor using a moving audio-visual target, " INRIA Grenoble Rhone-Alpes, 2012.

L. Lacheze, Y. Guo, R. Benosman, B. Gas, and C. Couverture, Audio/video fusion for objects recognition, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2009.
DOI : 10.1109/IROS.2009.5354442

N. A. Lili, A framework for human action detection via extraction of multimodal features, International Journal of Image Processing, vol.3, issue.2, 2009.

R. C. Luo and M. G. Kay, Multisensor integration and fusion in intelligent systems, Systems, Man and Cybernetics, pp.901-931, 1989.
DOI : 10.1109/21.44007

K. Nakadai, D. Matsuura, H. G. Okuno, and H. Tsujino, Improvement of recognition of simultaneous speech signals using AV integration and scattering theory for humanoid robots, Speech Communication, vol.44, issue.1-4, pp.97-112, 2004.
DOI : 10.1016/j.specom.2004.10.010

S. T. Shivappa, B. D. Rao, M. M. Trivedi, J. Sochman, and J. Matas, Auvio-visual fusion and tracking with multilevel iterative decoding: Framework and experimental evaluation Waldboost ? learning for time constrained sequential detection, Journal of Selected Topics in Signal Processing CVPR, 2005.

J. Wienke and S. Wrede, A middleware for collaborative research in experimental robotics, 2011 IEEE/SICE International Symposium on System Integration (SII), 2011.
DOI : 10.1109/SII.2011.6147617

Q. Wu, Z. Wang, F. Deng, and D. Feng, Realistic Human Action Recognition with Audio Context, 2010 International Conference on Digital Image Computing: Techniques and Applications, 2010.
DOI : 10.1109/DICTA.2010.57

J. and Y. Bouguet, Camera calibration toolbox for matlab

C. Zhang, P. Yin, Y. Rui, R. Cutler, P. Viola et al., Boosting-Based Multimodal Speaker Detection for Distributed Meeting Videos, IEEE Transactions on Multimedia, 2008.
DOI : 10.1109/TMM.2008.2007344