X. Alameda-pineda, V. Khalidov, R. Horaud, and F. Forbes, Finding audio-visual events in informal social gatherings, Proceedings of the 13th international conference on multimodal interfaces, ICMI '11, 2011.
DOI : 10.1145/2070481.2070527

URL : https://hal.archives-ouvertes.fr/inria-00623489

E. Arnaud, H. Christensen, Y. Lu, J. Barker, V. Khalidov et al., The CAVA corpus, Proceedings of the 10th international conference on Multimodal interfaces, IMCI '08, pp.109-116, 2008.
DOI : 10.1145/1452392.1452414

URL : https://hal.archives-ouvertes.fr/inria-00373173

E. Bailly-baillire, S. Bengio, F. Bimbot, M. Hamouz, J. Kittler et al., The BANCA Database and Evaluation Protocol, ICAVBPA, pp.625-638, 2003.
DOI : 10.1007/3-540-44887-X_74

J. Bouguet, Camera calibration toolbox for Matlab, 2008.

J. Cech, J. Sanchez-riera, and R. P. Horaud, Scene flow estimation by growing correspondence seeds, CVPR 2011, 2011.
DOI : 10.1109/CVPR.2011.5995442

URL : https://hal.archives-ouvertes.fr/inria-00590274

E. C. Cherry, Some Experiments on the Recognition of Speech, with One and with Two Ears, The Journal of the Acoustical Society of America, vol.25, issue.5, pp.975-979, 1953.
DOI : 10.1121/1.1907229

M. Cooke, J. Barker, S. Cunningham, and X. Shao, An audio-visual corpus for speech perception and automatic speech recognition, The Journal of the Acoustical Society of America, vol.120, issue.5, pp.384-401, 2007.
DOI : 10.1121/1.2229005

M. Hansard and R. P. Horaud, Cyclopean geometry of binocular vision, Journal of the Optical Society of America A, vol.25, issue.9, pp.2357-2369, 2008.
DOI : 10.1364/JOSAA.25.002357

URL : https://hal.archives-ouvertes.fr/inria-00435548

S. Haykin and Z. Chen, The Cocktail Party Problem, Neural Computation, vol.31, issue.2, pp.1875-1902, 2005.
DOI : 10.1016/0378-5955(91)90148-3

T. J. Hazen, K. Saenko, C. La, and J. R. Glass, A segment-based audiovisual speech recognizer: data collection, development, and initial experiments, ICMI, ICMI '04, pp.235-242, 2004.

G. Lathoud, J. Marc-odobez, and D. Gatica-perez, AV16.3: An Audio-Visual Corpus for Speaker Localization and Tracking, 2004.
DOI : 10.1007/978-3-540-30568-2_16

J. Liu, J. Luo, and M. Shah, Recognizing realistic actions from videos " in the wild, 2009.

S. Marcel, C. Mccool, P. Matejka, T. Ahonen, and J. Cernocky, Mobile biometry (MOBIO) face and speaker verification evaluation. Idiap-RR Idiap, 2010.
DOI : 10.1007/978-3-642-17711-8_22

URL : https://hal.archives-ouvertes.fr/hal-01318429

K. Messer, J. Matas, J. Kittler, and K. Jonsson, XM2VTSDB: The extended M2VTS database, ICAVBPA, pp.72-77, 1999.

Y. Mohammad, Y. Xu, K. Matsumura, and T. Nishida, The h3r explanation corpus human-human and base human-robot interaction dataset, ISSNIP, pp.201-206, 2008.

E. K. Patterson, S. Gurbuz, Z. Tufekci, and J. N. Gowdy, CUAVE: A new audiovisual database for multimodal human-computer interface research, ICASSP, pp.2017-2020, 2002.

S. Pigeon, M2vts database, 1996.

S. Vedula, S. Baker, P. Rander, R. Collins, and T. Kanade, Three-dimensional scene flow, IEEE Trans. on PAMI, vol.27, issue.3, 2005.

D. Weinland, R. Ronfard, and E. Boyer, Free viewpoint action recognition using motion history volumes, Computer Vision and Image Understanding, vol.104, issue.2-3, pp.249-257, 2006.
DOI : 10.1016/j.cviu.2006.07.013

URL : https://hal.archives-ouvertes.fr/inria-00544629

G. Willems, J. H. Becker, and T. Tuytelaars, Exemplar-based Action Recognition in Video, Procedings of the British Machine Vision Conference 2009, 2009.
DOI : 10.5244/C.23.90

Z. Zivkovic, O. Booij, B. Krose, E. Topp, and H. Christensen, From Sensors to Human Spatial Concepts: An Annotated Data Set, IEEE Transactions on Robotics, vol.24, issue.2, pp.501-505, 2008.
DOI : 10.1109/TRO.2008.918046