]. I. Gebru-17a, S. Gebru, X. Ba, &. R. Li, and . Horaud, Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
DOI : 10.1109/TPAMI.2017.2648793

]. I. Gebru-16a, X. Gebru, F. Alameda-pineda, &. R. Forbes, and . Horaud, EM Algorithms for Weighted-Data Clustering with Application to Audio-Visual Scene Analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.38, issue.12
DOI : 10.1109/TPAMI.2016.2522425

]. I. Gebru-17b, C. Gebru, P. A. Evers, &. R. Naylor, . D. Horaudgebru-15b-]-i et al., Audio-visual tracking by density approximation in a sequential Bayesian filtering framework Tracking the Active Speaker Based on a Joint Audio-Visual Observation Model, IEEE Transactions on Pattern Analysis and Machine Intelligence Handsfree Speech Communications and Microphone Arrays (HSCMA), 2017 Proceedings of the IEEE International Conference on Computer Vision Workshops, pp.2402-2415, 2015.

]. I. Gebru-15a, S. Gebru, G. Ba, &. R. Evangelidis, . D. Horaud-]-i et al., Audio-Visual Speech- Turn Detection and Tracking Audio-visual speaker localization via weighted clustering, The Twelfth International Conference on Latent Variable Analysis and Signal Separation Machine Learning for Signal Processing (MLSP), 2014 IEEE International Workshop on Evangelidis & R Horaud. A distributed architecture for interacting with NAO. In 105 106 CHAPTER, pp.2015-2016, 2014.

I. D. Nguyen, V. Gebru, G. Conotter, &. F. Boato, and . Gb-de-natale, Counter-forensics of median filtering, Multimedia Signal Processing (MMSP), 2013 IEEE 15th International Workshop on, pp.260-265, 2013.

M. Ackerman, S. Ben-david, S. Branzei, and &. D. Loker, Weighted Clustering, Proceedings of AAAI, 2012.

A. Alameda-pineda, V. Khalidov, R. Horaud, and &. F. Forbes, Finding audio-visual events in informal social gatherings, Proceedings of the 13th international conference on multimodal interfaces, ICMI '11, pp.247-254, 2011.
DOI : 10.1145/2070481.2070527
URL : https://hal.archives-ouvertes.fr/inria-00623489

A. Alameda-pineda, J. Sanchez-riera, J. Wienke, V. Franc, J. Cech et al., RAVEL: an annotated corpus for training robots with audiovisual abilities, Journal on Multimodal User Interfaces, vol.24, issue.2, pp.79-91, 2013.
DOI : 10.1109/TRO.2008.918046
URL : https://hal.archives-ouvertes.fr/hal-00720734

A. Alameda-pineda and &. R. Horaud, Vision-guided robot hearing, The International Journal of Robotics Research, vol.5, issue.1, pp.437-456, 2015.
DOI : 10.1080/01691864.2012.687152
URL : https://hal.archives-ouvertes.fr/hal-00990766

A. Alameda-pineda, Y. Yan, E. Ricci, O. Lanz, and &. N. Sebe, Analyzing Free-standing Conversational Groups, Proceedings of the 23rd ACM international conference on Multimedia, MM '15, pp.5-14, 2015.
DOI : 10.1109/TIP.2014.2365699

M. Andersson, S. Ntalampiras, T. Ganchev, J. Rydell, J. Ahlberg et al., Fusion of acoustic and optical sensor data for automatic fight detection in urban environments, 2010 13th International Conference on Information Fusion, pp.1-8, 2010.
DOI : 10.1109/ICIF.2010.5712105

]. J. Andrews-12, &. D. Andrews, and . Mcnicholas, Model-based clustering, classification, and discriminant analysis via mixtures of multivariate t-distributions, REFERENCES [Andrieu 03, pp.1021-1029, 2003.
DOI : 10.1007/978-1-4757-3121-7

X. A. Miro, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland et al., Speaker Diarization: A Review of Recent Research, IEEE Transactions on Audio, Speech, and Language Processing, vol.20, issue.2, pp.356-370, 2012.
DOI : 10.1109/TASL.2011.2125954

&. M. Archambeau and . Verleysen, Robust Bayesian clustering, Neural Networks, vol.20, issue.1, pp.129-138, 2007.
DOI : 10.1016/j.neunet.2006.06.009
URL : http://www.cs.ucl.ac.uk/staff/c.archambeau/publ/nn_ca07_web.pdf

E. Arnaud, H. Christensen, Y. Lu, J. Barker, V. Khalidov et al., The CAVA corpus, Proceedings of the 10th international conference on Multimodal interfaces, IMCI '08, pp.109-116, 2008.
DOI : 10.1145/1452392.1452414
URL : https://hal.archives-ouvertes.fr/inria-00373173

]. S. Ba, X. Alameda-pineda, A. Xompero, and &. R. Horaud, An on-line variational Bayesian model for multi-person tracking from cluttered scenes, Computer Vision and Image Understanding, vol.153, pp.64-76, 2016.
DOI : 10.1016/j.cviu.2016.07.006
URL : https://hal.archives-ouvertes.fr/hal-01349763

F. Badeig, Q. Pelorson, S. Arias, V. Drouard, I. Gebru et al., A Distributed Architecture for Interacting with NAO, Proceedings of the 2015 ACM on International Conference on Multimodal Interaction , ICMI '15, pp.385-386, 2015.
DOI : 10.1145/2818346.2823303
URL : https://hal.archives-ouvertes.fr/hal-01201716

S. Bae and &. Yoon, Robust Online Multi-object Tracking Based on Tracklet Confidence and Online Discriminative Appearance Learning, 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp.1218-1225, 2014.
DOI : 10.1109/CVPR.2014.159

F. Alameda-pineda, S. Badeig, &. R. Ba, and . Horaud, Tracking a Varying Number of People with a Visually-Controlled Robotic Head, IEEE/RSJ International Conference on Intelligent Robots and Systems, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01542987

]. Y. Ban-17b, L. Ban, X. Girin, &. R. Alameda-pineda, and . Horaud, Exploiting the Complementarity of Audio and Visual Data in Multi-Speaker Tracking, ICCV Workshop on Computer Vision for Audio- Visual Media, 2017.

J. Banfield and &. A. Raftery, Model-Based Gaussian and Non-Gaussian Clustering, Biometrics, vol.49, issue.3, pp.803-821, 1993.
DOI : 10.2307/2532201

]. Z. Barzelay-10, &. Y. Barzelay, and . Schechner, Onsets Coincidence for Cross-Modal Analysis, IEEE Transactions on Multimedia, vol.12, issue.2, pp.108-120, 2010.
DOI : 10.1109/TMM.2009.2037387

J. P. Baudry, E. A. Raftery, G. Celeux, K. Lo, and &. R. Gottardo, Combining Mixture Components for Clustering, Journal of Computational and Graphical Statistics, vol.19, issue.2, 2010.
DOI : 10.1198/jcgs.2010.08111
URL : https://hal.archives-ouvertes.fr/inria-00321090

M. J. Beal, H. Attias, and &. N. Jojic, Audio-Video Sensor Fusion with Probabilistic Graphical Models, Computer Vision?ECCV, pp.736-750, 2002.
DOI : 10.1007/3-540-47969-4_49
URL : http://www.gatsby.ucl.ac.uk/~beal/papers/eccv02.ps.gz

&. R. Bernardin and . Stiefelhagen, Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics, EURASIP Journal on Image and Video Processing, vol.4625, issue.1, pp.1-10, 2008.
DOI : 10.1137/0105003

P. Besson, V. Popovici, J. Vesin, J. Thiran, and &. M. Kunt, Extraction of audio features specific to speech production for multimodal speaker detection. Multimedia, IEEE Transactions on, vol.10, issue.1, pp.63-73, 2008.

&. M. Svensen, Robust Bayesian mixture modelling, Neurocomputing, vol.64, pp.235-252, 2005.
DOI : 10.1016/j.neucom.2004.11.018

C. M. Bishop and &. M. Nasrabadi, Pattern recognition and machine learning, 2006.

]. C. Blandin, A. Ozerov, and &. E. Vincent, Multi-source TDOA estimation in reverberant audio using angular spectra and clustering, Signal Processing, vol.92, issue.8, pp.1950-1960, 2012.
DOI : 10.1016/j.sigpro.2011.09.032
URL : https://hal.archives-ouvertes.fr/inria-00576297

&. E. Bohus and . Horvitz, Dialog in the open world, Proceedings of the 2009 international conference on Multimodal interfaces, ICMI-MLMI '09, pp.31-38, 2009.
DOI : 10.1145/1647314.1647323

D. Bohus and &. E. Horvitz, Facilitating multiparty dialog with gaze, gesture, and speech, International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction on, ICMI-MLMI '10, p.5, 2010.
DOI : 10.1145/1891903.1891910
URL : http://research.microsoft.com/users/horvitz/ICMI_2010_shaping_turn-taking.pdf

]. D. Bohus and &. E. Horvitz, Decisions about turns in multiparty conversation, Proceedings of the 13th international conference on multimodal interfaces, ICMI '11, pp.153-160, 2011.
DOI : 10.1145/2070481.2070507

L. Bourdev, &. J. Malik, M. S. Brandstein, and &. H. Silverman, Poselets: Body part detectors trained using 3d human pose annotations A robust method for speech signal time-delay estimation in reverberant rooms, IEEE 12th International Conference on Computer Vision Acoustics, Speech, and Signal Processing IEEE International Conference on, pp.1365-1372, 1997.

L. Breiman, J. Friedman, C. J. Stone, and &. R. Olshen, Classification and Regression Trees, 1984.

S. Burger, V. Maclaren, and &. H. Yu, The ISL meeting corpus: The impact of meeting type on speech style, Seventh International Conference on Spoken Language Processing, 2002.

T. Butz and &. J. Thiran, Feature space mutual information in speechvideo sequences, ICME, 2002.

J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot et al., Kronenthal & Others. The AMI meeting corpus: A pre-announcement, International Workshop on Machine Learning for Multimodal Interaction, pp.28-39, 2005.

G. Celeux, S. Chrétien, F. Forbes, and &. A. Mkhadri, A componentwise EM algorithm for mixtures, Journal of Computational and Graphical Statistics, vol.10, issue.4, 2001.
DOI : 10.1198/106186001317243403
URL : https://hal.archives-ouvertes.fr/inria-00072916

O. Celiktutan, E. Skordos, and &. H. Gunes, Multimodal Human-Human-Robot Interactions (MHHRI) Dataset for Studying Personality and Engagement, IEEE Transactions on Affective Computing, 2017.
DOI : 10.1109/TAFFC.2017.2737019

K. Checka, M. Wilson, &. T. Siracusa, and . Darrell, Multiple person and speaker activity tracking with a particle filter, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004.
DOI : 10.1109/ICASSP.2004.1327252

M. Cooke, J. Barker, S. Cunningham, and &. X. Shao, An audio-visual corpus for speech perception and automatic speech recognition, The Journal of the Acoustical Society of America, vol.120, issue.5, pp.2421-2424, 2006.
DOI : 10.1121/1.2229005

]. R. Cutler and &. L. Davis, Look who's talking: speaker detection using video and audio correlation, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532), pp.1589-1592, 2000.
DOI : 10.1109/ICME.2000.871073

D. Natale, Counter-forensics of median filtering, Multimedia Signal Processing (MMSP), 2013 IEEE 15th International Workshop on, pp.260-265, 2013.

]. D. Davies and &. W. Bouldin, A cluster separation measure. Pattern Analysis and Machine Intelligence, IEEE Transactions, issue.2, pp.224-227, 1979.
DOI : 10.1109/tpami.1979.4766909

A. Deleforge, F. Forbes, and &. R. Horaud, Variational EM for binaural sound-source separation and localization, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.76-80, 2013.
DOI : 10.1109/ICASSP.2013.6637612
URL : https://hal.archives-ouvertes.fr/hal-00823453

]. A. Deleforge-14a, V. Deleforge, L. Drouard, &. R. Girin, and . Horaud, Mapping Sounds on Images Using Binaural Spectrograms, Proceedings of the European Signal Processing Conference, 2014.

]. A. Deleforge-14b, V. Deleforge, L. Drouard, &. R. Girin, and . Horaud, Mapping Sounds on Images Using Binaural Spectrograms, European Signal Processing Conference, 2014.

]. A. Deleforge-14c, F. Deleforge, &. R. Forbes, and . Horaud, Acoustic Space Learning for Sound-Source Separation and Localization on Binaural Manifolds, International Journal of Neural Systems, vol.7, issue.01, 2014.
DOI : 10.1109/TSA.2005.858005

]. A. Deleforge-15a, R. Deleforge, Y. Y. Horaud, &. L. Schechner, and . Girin, Co-Localization of Audio Sources in Images Using Binaural Features and Locally-Linear Regression, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.23, issue.4, pp.718-731, 2015.
DOI : 10.1109/TASLP.2015.2405475

]. A. Deleforge-15b, R. Deleforge, Y. Y. Horaud, &. L. Schechner, and . Girin, Co-Localization of Audio Sources in Images Using Binaural Features and Locally-Linear Regression, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.23, issue.4, pp.718-731, 2015.
DOI : 10.1109/TASLP.2015.2405475

]. I. Dhillon, Y. Guan, and &. B. Kulis, Kernel k-means, Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining , KDD '04, pp.551-556, 2004.
DOI : 10.1145/1014052.1014118

E. Khoury, C. Sénac, and &. P. Joly, Audiovisual diarization of people in video content, Multimedia tools and applications, pp.1692-1703, 2014.
DOI : 10.1007/978-3-540-68585-2_49

]. C. Evers, A. H. Moore, and &. P. Naylor, Multiple source localisation in the spherical harmonic domain, 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC), 2014.
DOI : 10.1109/IWAENC.2014.6954298

]. C. Evers, A. H. Moore, P. A. Naylor, J. Sheaffer, and &. B. Rafaely, Bearing-only acoustic tracking of moving speakers for robot audition, 2015 IEEE International Conference on Digital Signal Processing (DSP), 2015.
DOI : 10.1109/ICDSP.2015.7252071

]. D. Feldman and &. L. Schulman, Data reduction for weighted and outlier-resistant clustering, Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, pp.1343-1354, 2012.
DOI : 10.1137/1.9781611973099.106
URL : http://epubs.siam.org/doi/pdf/10.1137/1.9781611973099.106

V. Ferrari, M. Marin-jimenez, and &. A. Zisserman, Progressive search space reduction for human pose estimation, 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp.1-8, 2008.
DOI : 10.1109/CVPR.2008.4587468
URL : http://eprints.pascal-network.org/archive/00004745/01/cvpr.pdf

M. A. Figueiredo and &. A. Jain, Unsupervised learning of finite mixture models, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.24, issue.3, pp.381-396, 2002.
DOI : 10.1109/34.990138

J. G. Fiscus, N. Radde, J. S. Garofolo, A. Le, J. Ajot et al., The Rich Transcription 2005 Spring Meeting Recognition Evaluation, International Workshop on Machine Learning for Multimodal Interaction, pp.369-389, 2005.
DOI : 10.1007/11677482_32
URL : http://www.itl.nist.gov/iad/mig/publications/storage_paper/RT06SResults-v07.pdf

J. G. Fiscus, J. Ajot, M. Michel, and &. J. Garofolo, The Rich Transcription 2006 Spring Meeting Recognition Evaluation, International Workshop on Machine Learning for Multimodal Interaction, pp.309-322, 2006.
DOI : 10.1007/11965152_28

I. Fisher, ]. J. Fisher, I. , T. Darrell, W. T. Freeman et al., Learning joint statistical models for audio-visual fusion and segregation, NIPS, pp.772-778, 2000.

&. T. Fisher and . Darrell, Speaker Association With Signal-Level Audiovisual Fusion, IEEE Transactions on Multimedia, vol.6, issue.3, pp.406-413, 2004.
DOI : 10.1109/TMM.2004.827503
URL : http://people.csail.mit.edu/~fisher/publications/papers/fisher04tmm.pdf

F. Forbes, S. Doyle, D. Garcia-lorenzo, C. Barillot, and M. , Dojatet al. A weighted multi-sequence Markov model for brain lesion segmentation, JMLR Workshop and Conference Proceedings, pp.225-232, 2010.

]. F. Forbes-14, &. D. Forbes, and . Wraith, A new family of multivariate heavy-tailed distributions with variable marginal amounts of tailweight: application to robust clustering, Statistics and Computing, vol.94, issue.1, pp.971-984, 2014.
DOI : 10.1016/S0378-3758(00)00208-1

]. P. Frey-91, &. J. Frey, and . Slate, Letter recognition using Holland-style adaptive classifiers, Machine Learning, pp.161-182, 1991.
DOI : 10.1007/BF00114162

G. Garau, A. Dielmann, and &. H. Bourlard, Audio-visual synchronisation for speaker diarisation, INTERSPEECH, pp.2654-2657, 2010.

J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and &. D. Pallett, DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM, NASA STI/Recon technical report n, 1993.

J. S. Garofolo, C. Laprun, M. Michel, V. M. Stanford, and &. E. Tabassi, The NIST Meeting Room Pilot Corpus, LREC, 2004.

]. D. Gatica-perez-07, G. Gatica-perez, J. Lathoud, &. I. Odobez, and . Mccowan, Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings, IEEE Transactions on Audio, Speech and Language Processing, vol.15, issue.2, pp.601-616, 2007.
DOI : 10.1109/TASL.2006.881678

]. D. Gatica-perez-09 and . Gatica-perez, Automatic nonverbal analysis of social interaction in small groups: A review, Image and Vision Computing, vol.27, issue.12, pp.1775-1787, 2009.
DOI : 10.1016/j.imavis.2009.01.004

I. D. Gebru, X. Alameda-pineda, R. Horaud, and &. F. Forbes, Audiovisual speaker localization via weighted clustering, Machine Learning for Signal Processing (MLSP), 2014 IEEE International Workshop on, pp.1-6, 2014.
DOI : 10.1109/mlsp.2014.6958874
URL : https://hal.archives-ouvertes.fr/hal-01053732

]. I. Gebru-15a, S. Gebru, G. Ba, &. R. Evangelidis, and . Horaud, Audio-Visual Speech-Turn Detection and Tracking, The Twelfth International Conference on Latent Variable Analysis and Signal Separation, 2015.
DOI : 10.1007/978-3-319-22482-4_17

]. I. Gebru-15b, S. Gebru, G. Ba, &. R. Evangelidis, and . Horaud, Tracking the Active Speaker Based on a Joint Audio-Visual Observation Model, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW), pp.15-21, 2015.
DOI : 10.1109/ICCVW.2015.96

]. I. Gebru-16a, X. Gebru, F. Alameda-pineda, &. R. Forbes, and . Horaud, EM Algorithms for Weighted-Data Clustering with Application to Audio-Visual Scene Analysis, REFERENCES [Gebru 16b] I. Gebru, X. Alameda-Pineda, F. Forbes & R. Horaud. {EM} Algorithms for Weighted-Data Clustering with Application to Audio- Visual Scene Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp.2402-2415, 2016.
DOI : 10.1109/TPAMI.2016.2522425

]. I. Gebru-17b, C. Gebru, P. A. Evers, &. R. Naylor, and . Horaud, Audio-visual tracking by density approximation in a sequential Bayesian filtering framework, 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA), pp.71-75, 2017.
DOI : 10.1109/HSCMA.2017.7895564

D. Görür and &. C. Rasmussen, Dirichlet Process Gaussian Mixture Models: Choice of the Base Distribution, Journal of Computer Science and Technology, vol.25, issue.3, pp.653-664, 2010.
DOI : 10.1007/s11390-010-9355-8

M. Gurban and &. Thiran, Multimodal speaker localization in a probabilistic framework, Signal Processing Conference 14th European, pp.1-5, 2006.

S. S. Haykin, Kalman filtering and neural networks, 2001.
DOI : 10.1002/0471221546

]. T. Hazen, K. Saenko, C. La, and &. J. Glass, A segmentbased audio-visual speech recognizer: Data collection, development , and initial experiments, Proceedings of the 6th international conference on Multimodal interfaces, pp.235-242, 2004.
DOI : 10.1145/1027933.1027972

J. Hershey and &. J. Movellan, Audio-vision: Using audio-visual synchrony to locate sounds, Advances in Neural Information Processing Systems, pp.813-819, 2000.

J. R. Hoffman and &. P. Mahler, Multitarget Miss Distance via Optimal Assignment, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, vol.34, issue.3, pp.327-336, 2004.
DOI : 10.1109/TSMCA.2004.824848

L. Itti, C. Koch, E. Niebur, and &. Others, A model of saliency-based visual attention for rapid scene analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.20, issue.11, pp.1254-1259, 1998.
DOI : 10.1109/34.730558

]. A. Janin-03, D. Janin, J. Baron, D. Edwards, D. Ellis et al., Stolckeet al. The ICSI meeting corpus, Acoustics, Speech, and Signal Processing Proceedings .(ICASSP'03). 2003 IEEE International Conference on, 2003.

]. D. Jayagopi, S. Sheiki, D. Klotz, J. Wienke, J. Odobez et al., The vernissage corpus: A conversational human-robotinteraction dataset, Proceedings of the 8th ACM/IEEE international conference on Human-robot interaction, pp.149-150, 2013.

M. Johansson, G. Skantze, and &. J. Gustafson, Comparison of Human-Human and Human-Robot Turn-Taking Behaviour in Multiparty Situated Interaction, Proceedings of the 2014 workshop on Understanding and Modeling Multiparty, Multimodal Interactions, UM3I '14, pp.21-26, 2014.
DOI : 10.1142/S0219843613500059

]. S. Julier and &. J. Uhlmann, A general method for approximating nonlinear transformations of probability distributions. Rapport technique, Robotics Research Group, 1996.

]. S. Julier and &. J. Uhlmann, New extension of the Kalman filter to nonlinear systems, Signal Processing, Sensor Fusion, and Target Recognition VI, pp.182-193, 1997.
DOI : 10.1117/12.280797

M. Kächele, S. Meudt, A. Schwarz, and &. F. Schwenker, Audio-Visual User Identification in HCI Scenarios, IAPR Workshop on Multimodal Pattern Recognition of Social Signals in Human- Computer Interaction, pp.113-122, 2014.
DOI : 10.1007/978-3-319-14899-1_11

]. V. Khalidov-08a, F. Khalidov, M. Forbes, E. Hansard, &. R. Arnaud et al., Audio-Visual Clustering for 3D Speaker Localization, International Workshop on Machine Learning for Multimodal Interaction, pp.86-97, 2008.
DOI : 10.1007/978-3-540-85853-9_8

]. V. Khalidov-08b, F. Khalidov, M. Forbes, E. Hansard, &. R. Arnaud et al., Detection and localization of 3d audio-visual objects using unsupervised clustering, Proceedings of the 10th international conference on Multimodal interfaces, IMCI '08, pp.217-224, 2008.
DOI : 10.1145/1452392.1452438

]. V. Khalidov-11a, F. Khalidov, &. R. Forbes, . Horaud-]-v, F. Khalidov et al., Conjugate Mixture Models for Clustering Multimodal Data, Neural Computation, vol.49, issue.3, pp.517-557, 2011.
DOI : 10.1007/978-94-011-3436-1

V. Khalidov, F. Forbes, and &. R. Horaud, Alignment of binocular-binaural data using a moving audio-visual target, 2013 IEEE 15th International Workshop on Multimedia Signal Processing (MMSP), 2013.
DOI : 10.1109/MMSP.2013.6659295
URL : https://hal.archives-ouvertes.fr/hal-00861482

E. Kidron, Y. Y. Schechner, and &. M. Elad, Pixels that Sound, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), pp.88-95, 2005.
DOI : 10.1109/CVPR.2005.274

E. Kidron, Y. Schechner, and &. M. Elad, Cross-Modal Localization via Sparsity, IEEE Transactions on Signal Processing, vol.55, issue.4, pp.1390-1404, 2007.
DOI : 10.1109/TSP.2006.888095

]. V. Kilic-15a, M. Kilic, W. Barnard, &. J. Wang, and . Kittler, Audio Assisted Robust Visual Tracking With Adaptive Particle Filtering, IEEE Transactions on Multimedia, vol.17, issue.2, pp.186-200, 2015.
DOI : 10.1109/TMM.2014.2377515

]. V. K?l?ç-15b, M. K?l?ç, W. Barnard, &. J. Wang, and . Kittler, Audio Assisted Robust Visual Tracking With Adaptive Particle Filtering, IEEE Transactions on Multimedia, vol.17, issue.2, pp.186-200, 2015.
DOI : 10.1109/TMM.2014.2377515

S. Kotz and &. S. Nadarajah, Multivariate t Distributions and their Applications, 2004.

H. W. Kuhn, The Hungarian method for the assignment problem, Naval Research Logistics Quarterly, vol.3, issue.1-2, pp.83-97, 1955.
DOI : 10.2140/pjm.1953.3.369

&. G. Fink and . Sagerer, Providing the basis for human-robotinteraction: A multi-modal attention system for a mobile robot, International conference on Multimodal interfaces, 2003.

G. Lathoud, J. Odobez, and &. D. Gatica-perez, AV16. 3: an audiovisual corpus for speaker localization and tracking, Machine Learning for Multimodal Interaction, pp.182-195, 2004.

G. Lathoud, J. Odobez, and &. D. Gatica-perez, AV16 . 3 : An Audio-Visual Corpus for Speaker Localization and Tracking, pp.182-195, 2005.

Y. Lecun, L. Bottou, Y. Bengio, and &. P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, pp.2278-2324, 1998.
DOI : 10.1109/5.726791
URL : http://www.cs.berkeley.edu/~daf/appsem/Handwriting/papers/00726791.pdf

]. S. Lee-14, &. J. Lee, and . Mclachlan, Finite mixtures of multivariate skew t-distributions: some recent and new results, Statistics and Computing, vol.82, issue.4, pp.181-202, 2014.
DOI : 10.1109/DICTA.2009.88

]. X. Li-15a, L. Li, R. Girin, &. S. Horaud, and . Gannot, Estimation of relative transfer function in the presence of stationary noise based on segmental power spectral density matrix subtraction, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015.
DOI : 10.1109/ICASSP.2015.7177983

]. X. Li-15b, R. Li, L. Horaud, &. S. Girin, and . Gannot, Local Relative Transfer Function for Sound Source Localization, European Signal Processing Conference, 2015.

]. X. Li, L. Girin, S. Gannot, and &. R. Horaud, Non-stationary noise power spectral density estimation based on regional statistics, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.181-185, 2016.
DOI : 10.1109/ICASSP.2016.7471661
URL : https://hal.archives-ouvertes.fr/hal-01250892

M. Zhang, X. Wu, and &. S. Yu, Spectral clustering for multi-type relational data, Proceedings of the 23rd International Conference on Machine learning, pp.585-592, 2006.

M. I. Mandel, R. J. Weiss, and &. P. Ellis, Model-Based Expectation-Maximization Source Separation and Localization, IEEE Transactions on Audio, Speech, and Language Processing, vol.18, issue.2, pp.382-394, 2010.
DOI : 10.1109/TASL.2009.2029711
URL : http://www.ee.columbia.edu/%7Eronw/pubs/taslp09-messl.pdf

S. Mccool, A. Marcel, M. Hadid, P. Pietikäinen, J. Matejka et al., Bi-Modal Person Recognition on a Mobile Phone: Using Mobile Phone Data, 2012 IEEE International Conference on Multimedia and Expo Workshops, pp.635-640, 2012.
DOI : 10.1109/ICMEW.2012.116

]. G. Mclachlan-00a, &. D. Mclachlan, and . Peel, Finite Mixture Models, 2000.
DOI : 10.1002/0471721182

]. G. Mclachlan-00b, &. D. Mclachlan, and . Peel, Robust Mixture Modelling Using the t Distribution, Statistics and Computing, vol.10, issue.4, pp.339-348, 2000.

V. P. Minotto, C. R. Jung, and &. B. Lee, Multimodal Multi-Channel On-Line Speaker Diarization Using Sensor Fusion Through SVM, Intelligent Sensors, Sensor Networks and Information Processing ISSNIP 2008. International Conference on, pp.1694-1705, 2008.
DOI : 10.1109/TMM.2015.2463722

]. S. Naqvi-10a, M. Naqvi, &. J. Yu, and . Chambers, A Multimodal Approach to Blind Source Separation of Moving Sources, IEEE Journal of Selected Topics in Signal Processing, vol.4, issue.5, pp.895-910, 2010.
DOI : 10.1109/JSTSP.2010.2057198

]. S. Naqvi-10b, M. Naqvi, &. J. Yu, and . Chambers, A Multimodal Approach to Blind Source Separation of Moving Sources, IEEE Journal of Selected Topics in Signal Processing, vol.4, issue.5, pp.895-910, 2010.
DOI : 10.1109/JSTSP.2010.2057198

]. H. Nock, G. Iyengar, and &. C. Neti, Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study, International conference on image and video retrieval, pp.488-499, 2003.
DOI : 10.1007/3-540-45113-7_48
URL : http://www.research.ibm.com/AVSTG/Speaker_Localisation_CIVR_2003.pdf

&. B. Noulas and . Krose, On-line multi-modal speaker diarization, Proceedings of the ninth international conference on Multimodal interfaces , ICMI '07, pp.350-357, 2007.
DOI : 10.1145/1322192.1322254

]. A. Noulas, G. Englebienne, and &. B. Krose, Multimodal Speaker Diarization, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.34, issue.1, pp.79-93, 2012.
DOI : 10.1109/TPAMI.2011.47

K. Otsuka, H. Sawada, and &. J. Yamato, Automatic inference of crossmodal nonverbal interactions in multiparty conversations: who responds to whom, when, and how? from gaze, head gestures, and utterances, Proceedings of the 9th international conference on Multimodal interfaces, pp.255-262, 2007.

E. K. Patterson, S. Gurbuz, Z. Tufekci, and &. J. Gowdy, CUAVE: A new audio-visual database for multimodal human-computer interface research, Acoustics, Speech, and Signal Processing (ICASSP) IEEE International Conference on, p.2017, 2002.

T. Poggio and &. F. Girosi, A theory of networks for approximation and learning. Rapport technique, DTIC Document, 1989.

G. Potamianos, C. Neti, G. Gravier, A. Garg, and &. W. Senior, Recent advances in the automatic recognition of audiovisual speech, Proceedings of the IEEE, pp.1306-1326, 2003.

]. C. Rasmussen, The infinite Gaussian mixture model, NIPS, pp.554-560, 1999.

]. B. Rivet-07, L. Rivet, &. C. Girin, and . Jutten, Mixing Audiovisual Speech Processing and Blind Source Separation for the Extraction of Speech Signals From Convolutive Mixtures, IEEE Transactions on Audio, Speech and Language Processing, vol.15, issue.1, pp.96-108, 2007.
DOI : 10.1109/TASL.2006.872619

]. J. Sanchez-riera-12, X. Sanchez-riera, J. Alameda-pineda, A. Wienke, S. Deleforge et al., Online multimodal speaker detection for humanoid robots, 2012 12th IEEE-RAS International Conference on Humanoid Robots (Humanoids 2012), 2012.
DOI : 10.1109/HUMANOIDS.2012.6651509

N. Sarafianos, T. Giannakopoulos, and &. S. Petridis, Audio-visual speaker diarization using fisher linear semi-discriminant analysis, Multimedia Tools and Applications, pp.115-130, 2016.
DOI : 10.1007/BF01210504

M. E. Sargin, Y. Yemez, E. Erzin, and &. M. Tekalp, Audiovisual Synchronization and Fusion Using Canonical Correlation Analysis, IEEE Transactions on Multimedia, vol.9, issue.7, pp.1396-1403, 2007.
DOI : 10.1109/TMM.2007.906583
URL : http://network.ku.edu.tr/~yyemez/ieeetransmultimedia07.pdf

D. Schuhmacher, B. Vo, and &. Vo, A Consistent Metric for Performance Evaluation of Multi-Object Filters, IEEE Transactions on Signal Processing, vol.56, issue.8, pp.3447-3457, 2008.
DOI : 10.1109/TSP.2008.920469

S. J. Sheather and &. M. Jones, A reliable data-based bandwidth selection method for kernel density estimation, Journal of the Royal Statistical Society. Series B (Methodological), pp.683-690, 1991.

]. J. Shi-00, &. J. Shi, and . Malik, Normalized cuts and image segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.22, issue.8, pp.888-905, 2000.

S. T. Shivappa, M. M. Trivedi, and &. D. Rao, Audiovisual Information Fusion in Human???Computer Interfaces and Intelligent Environments: A Survey, Proceedings of the IEEE, pp.1692-1715, 2010.
DOI : 10.1109/JPROC.2010.2057231

M. R. Siracusa and &. J. Fisher, Dynamic dependency tests for audio-visual speaker association Speech and Signal Processing- ICASSP'07 REFERENCES [Skantze 14] G. Skantze, A. Hjalmarsson & C. Oertel. Turn-taking, feedback and joint attention in situated human?robot interaction, IEEE International Conference on Acoustics, pp.457-50, 2007.
DOI : 10.1109/icassp.2007.366271
URL : http://people.csail.mit.edu/fisher/publications/papers/siracusa07icassp.pdf

J. Sohn, N. S. Kim, and &. W. Sung, A statistical model-based voice activity detection, IEEE Signal Processing Letters, vol.6, issue.1, pp.1-3, 1999.
DOI : 10.1109/97.736233

R. Stiefelhagen, K. Bernardin, R. Bowers, J. Garofolo, D. Mostefa et al., The CLEAR 2006 Evaluation, International Evaluation Workshop on Classification of Events, Activities and Relationships, pp.1-44, 2006.
DOI : 10.1007/978-3-540-69568-4_1

R. Stiefelhagen, R. Bowers, and &. J. Fiscus, Multimodal Technologies for Perception of Humans: International Evaluation Workshops CLEAR 2007 and RT, 2007.

W. Street, W. Wolberg, &. O. Mangasarian, J. Sun, A. Kabán et al., Nuclear feature extraction for breast tumor diagnosis Robust mixture clustering using Pearson type VII distribution, IS&T/SPIE's Symposium on Electronic Imaging: Science and Technology, pp.861-870, 1993.

F. Talantzis, A. Pnevmatikakis, and &. A. Constantinides, Audio???Visual Active Speaker Tracking in Cluttered Indoors Environments, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol.38, issue.3, pp.799-807, 2008.
DOI : 10.1109/TSMCB.2008.922063

R. Van-der-merwe, A. Doucet, N. De-freitas, and &. E. Wan, The unscented particle filter, Advances In Neural Information Processing Systems, pp.584-590, 2000.

B. D. Van-veen and &. K. Buckley, Beamforming: a versatile approach to spatial filtering, IEEE ASSP Magazine, vol.5, issue.2, pp.4-24, 1988.
DOI : 10.1109/53.665

D. Vijayasenan and &. F. Valente, DiarTk: An Open Source Toolkit for Research in Multistream Speaker Diarization and its Application to Meetings Recordings, INTERSPEECH, pp.2170-2173, 2012.

]. A. Vinciarelli-12, M. Vinciarelli, D. Pantic, C. Heylen, I. Pelachaud et al., Bridging the Gap between Social Animal and Unsocial Machine: A Survey of Social Signal Processing, IEEE Transactions on Affective Computing, vol.3, issue.1, pp.69-87, 2012.
DOI : 10.1109/T-AFFC.2011.27

&. M. Jones, Robust real-time face detection, International Journal of Computer Vision, vol.57, issue.2, pp.137-154, 2004.

]. X. Wei and &. C. Li, The infinite Student's t-mixture for robust modeling, Signal Processing, vol.92, issue.1, pp.224-234, 2012.
DOI : 10.1016/j.sigpro.2011.07.010

W. Xi, B. Zhang, Z. Chen, Y. Lu, S. Yan et al., Link fusion, Proceedings of the 13th conference on World Wide Web , WWW '04, pp.319-327, 2004.
DOI : 10.1145/988672.988715

Y. Yan, E. Ricci, R. Subramanian, O. Lanz, and &. N. Sebe, No Matter Where You Are: Flexible Graph-Guided Multi-task Learning for Multi-view Head Pose Classification under Target Motion, 2013 IEEE International Conference on Computer Vision, pp.1177-1184, 2013.
DOI : 10.1109/ICCV.2013.150
URL : http://vintage.winklerbros.net/Publications/iccv2013.pdf

H. Yehia, P. Rubin, and &. Vatikiotis-bateson, Quantitative association of vocal-tract and facial behavior, Speech Communication, vol.26, issue.1-2, pp.23-43, 1998.
DOI : 10.1016/S0167-6393(98)00048-X

B. Yerebakan, &. M. Rajwa, and . Dundar, The Infinite Mixture of Infinite Gaussian Mixtures, Advances in Neural Information Processing Systems, pp.28-36, 2014.

M. Zancanaro, B. Lepri, and &. F. Pianesi, Automatic detection of group functional roles in face to face interactions, Proceedings of the 8th international conference on Multimodal interfaces , ICMI '06, pp.28-34, 2006.
DOI : 10.1145/1180995.1181003

&. G. Zhao and . Karypis, Evaluation of hierarchical clustering algorithms for document datasets, Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp.515-524, 2002.

]. X. Zhao, N. Evans, and &. Dugelay, CO-LDA: A Semi-supervised Approach to Audio-Visual Person Recognition, 2012 IEEE International Conference on Multimedia and Expo, pp.356-361, 2012.
DOI : 10.1109/ICME.2012.14

M. Taj and &. A. Cavallaro, Target detection and tracking with heterogeneous sensors, IEEE Journal of Selected Topics in Signal Processing, vol.2, issue.4, pp.503-513, 2008.

X. Zhu and &. D. Ramanan, Face detection, pose estimation, and landmark localization in the wild, Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp.2879-2886, 2012.

D. Zotkin, R. Duraiswami, and &. L. Davis, Joint Audio-Visual Tracking Using Particle Filters, EURASIP Journal on Advances in Signal Processing, vol.2002, issue.11, pp.1154-1164, 2002.
DOI : 10.1155/S1110865702206058
URL : https://doi.org/10.1155/s1110865702206058