X. Alameda-pineda, V. Khalidov, R. P. Horaud, and F. Forbes, Finding audio-visual events in informal social gatherings, Proceedings of the 13th international conference on multimodal interfaces, ICMI '11, pp.247-254, 2011.
DOI : 10.1145/2070481.2070527

URL : https://hal.archives-ouvertes.fr/inria-00623489

A. Brooks, Coordinating human-robot communication, 2007.

Y. Fu, R. Li, T. Huang, and M. Danielsen, Real-time multimodal humanavatar interaction INRIA Calibration of A Binocular-Binaural Sensor Using an Audio-Visual Target 25, Trans. on Cir.Sys.Video, vol.18, issue.4, pp.467-477, 2008.

W. Feng, L. Xie, J. Zeng, and L. Zhi-qiang, Audio-visual human recognition using semi-supervised spectral learning and hidden Markov models, Journal of Visual Languages & Computing, vol.20, issue.3, pp.188-195, 2009.
DOI : 10.1016/j.jvlc.2009.01.009

S. Petridis and M. Pantic, Audiovisual Discrimination Between Speech and Laughter: Why and When Visual Information Might Help, IEEE Transactions on Multimedia, vol.13, issue.2, pp.216-234, 2011.
DOI : 10.1109/TMM.2010.2101586

M. Cristani, M. Bicego, and V. Murino, Audio-Visual Event Recognition in Surveillance Video Sequences, IEEE Transactions on Multimedia, vol.9, issue.2, pp.257-267, 2007.
DOI : 10.1109/TMM.2006.886263

T. Kuhnapfel, T. Tan, S. Venkatesh, and E. Lehmann, Calibration of Audio-Video Sensors for Multi-Modal Event Indexing, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '07, pp.741-744, 2007.
DOI : 10.1109/ICASSP.2007.366342

Z. Barzelay and Y. Schechner, Onsets Coincidence for Cross-Modal Analysis, IEEE Transactions on Multimedia, vol.12, issue.2, pp.108-120, 2010.
DOI : 10.1109/TMM.2009.2037387

A. Llagostera-casanovas, G. Monaci, P. Vandergheynst, and R. Gribonval, Blind Audiovisual Source Separation Based on Sparse Redundant Representations, IEEE Transactions on Multimedia, vol.12, issue.5, pp.358-371, 2010.
DOI : 10.1109/TMM.2010.2050650

URL : https://hal.archives-ouvertes.fr/inria-00541412

D. Gatica-perez, G. Lathoud, J. Odobez, and I. Mccowan, Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings, IEEE Transactions on Audio, Speech and Language Processing, vol.15, issue.2, pp.601-616, 2007.
DOI : 10.1109/TASL.2006.881678

J. Vermaak, M. Ganget, A. Blake, and P. Pérez, Sequential Monte Carlo fusion of sound and vision for speaker tracking, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, pp.741-746, 2001.
DOI : 10.1109/ICCV.2001.937600

P. Perez, J. Vermaak, and A. Blake, Data Fusion for Visual Tracking With Particles, Proceedings of IEEE, pp.495-513, 2004.
DOI : 10.1109/JPROC.2003.823147

V. Khalidov, F. Forbes, M. Hansard, E. Arnaud, and R. Horaud, Detecion and localization of 3D audio-visual objects using unsupervised clustering, Proc. of ICMI, 2008.

S. Shivappa, M. Trivedi, and B. Rao, Audiovisual Information Fusion in Human???Computer Interfaces and Intelligent Environments: A Survey, Proceedings of the IEEE, pp.1692-1715, 2010.
DOI : 10.1109/JPROC.2010.2057231

B. Stein and T. Stanford, Multisensory integration: current issues from the perspective of the single neuron, Nature Reviews Neuroscience, vol.31, issue.4, pp.255-266, 2008.
DOI : 10.1016/j.neuron.2007.12.013

A. J. King, Visual influences on auditory spatial learning, Philosophical Transactions of the Royal Society B: Biological Sciences, vol.24, issue.17, pp.331-339, 2009.
DOI : 10.1523/JNEUROSCI.0199-04.2004

M. Beal, N. Jojic, H. Attias, K. Wilson, M. Siracusa et al., A graphical model for audiovisual object tracking, Proc. of IEEE Conference on Acoustics, Speech, and Signal Processing, pp.828-836, 2003.
DOI : 10.1109/TPAMI.2003.1206512

T. Hospedales and S. Vijayakumar, Structure Inference for Bayesian Multisensory Scene Understanding, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.30, issue.12, pp.2140-2157, 2008.
DOI : 10.1109/TPAMI.2008.25

V. Khalidov, F. Forbes, and R. Horaud, Conjugate Mixture Models for Clustering Multimodal Data, Neural Computation, vol.49, issue.3, pp.517-557, 2011.
DOI : 10.1007/978-94-011-3436-1

URL : https://hal.archives-ouvertes.fr/inria-00590267

K. Nickel, T. Gehrig, R. Stiefelhagen, and J. Mcdonough, A joint particle filter for audio-visual speaker tracking, Proceedings of the 7th international conference on Multimodal interfaces , ICMI '05, pp.61-68, 2005.
DOI : 10.1145/1088463.1088477

D. N. Zotkin, R. Duraiswami, and L. S. Davis, Joint Audio-Visual Tracking Using Particle Filters, EURASIP Journal on Advances in Signal Processing, vol.2002, issue.11, pp.1154-1164, 2002.
DOI : 10.1155/S1110865702206058

V. Willert, J. Eggert, J. Adamy, R. Stahl, and E. Koerner, A Probabilistic Model for Binaural Sound Localization, IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics), vol.36, issue.5, pp.982-994, 2006.
DOI : 10.1109/TSMCB.2006.872263

V. Raykar and R. Duraiswami, Automatic position calibration of multiple microphones, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, pp.69-72, 2004.
DOI : 10.1109/ICASSP.2004.1326765

S. Birchfield and A. Subramanya, Microphone array position calibration by basis-point classical multidimensional scaling, IEEE Transactions on Speech and Audio Processing, vol.13, issue.5, pp.1025-1034, 2005.
DOI : 10.1109/TSA.2005.851893

S. Thrun, Affine structure from sound, Proceedings of Conference on Neural Information Processing Systems (NIPS), 2005.

M. Pollefeys and D. Nister, Direct computation of sound and microphone locations from time-difference-of-arrival data, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.2445-2448, 2008.
DOI : 10.1109/ICASSP.2008.4518142

P. Aarabi, Self-localizing dynamic microphone arrays, IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews), vol.32, issue.4, pp.474-484, 2002.
DOI : 10.1109/TSMCB.2002.804369

J. Chen, R. Hudson, and K. Yao, Maximum-likelihood source localization and unknown sensor location estimation for wideband signals in the near-field, IEEE Transactions on Signal Processing, vol.50, issue.8, pp.1843-1854, 2002.
DOI : 10.1109/TSP.2002.800420

A. Odonovan, R. Duraiswami, and J. Neumann, Microphone Arrays as Generalized Cameras for Integrated Audio Visual Processing, 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp.1-8, 2007.
DOI : 10.1109/CVPR.2007.383345

R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, 2003.
DOI : 10.1017/CBO9780511811685

E. Ettinger and Y. Freund, Coordinate-free calibration of an acoustically driven camera pointing system, 2008 Second ACM/IEEE International Conference on Distributed Smart Cameras, pp.1-9, 2008.
DOI : 10.1109/ICDSC.2008.4635685

Z. Barzelay and Y. Schechner, Harmony in Motion, 2007 IEEE Conference on Computer Vision and Pattern Recognition, 2007.
DOI : 10.1109/CVPR.2007.383344

A. Deleforge and R. P. Horaud, The cocktail party robot, Proceedings of the seventh annual ACM/IEEE international conference on Human-Robot Interaction, HRI '12, 2012.
DOI : 10.1145/2157689.2157834

URL : https://hal.archives-ouvertes.fr/hal-00768668

J. Spall, Introduction to Stochastic Searchand Optimization: Estimation, Simulation and Control, 2003.

G. Mclachlan and T. Krishnan, The EM Algorithm and Extensions, 2007.

H. Christensen, N. Ma, S. Wrigley, and J. Barker, Integrating pitch and localisation cues at a speech fragment level, Proc. of Interspeech, pp.2769-2772, 2007.