D. Gatica-perez, G. Lathoud, J. Odobez, and I. Mccowan, Audiovisual probabilistic tracking of multiple speakers in meetings, IEEE Transactions on Audio, Speech and Language Processing, vol.15, issue.2, pp.601-616, 2007.

T. Hospedales and S. Vijayakumar, Structure inference for Bayesian multisensory scene understanding, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.30, issue.12, pp.2140-2157, 2008.

S. Naqvi, M. Yu, and J. Chambers, A multimodal approach to blind source separation of moving sources, IEEE Journal of Selected Topics in Signal Processing, vol.4, issue.5, pp.895-910, 2010.

V. K?l?ç, M. Barnard, W. Wang, and J. Kittler, Audio assisted robust visual tracking with adaptive particle filtering, IEEE Transactions on Multimedia, vol.17, issue.2, pp.186-200, 2015.

N. Schult, T. Reineking, T. Kluss, and C. Zetzsche, Information-driven active audio-visual source localization, PloS one, vol.10, issue.9, 2015.

M. Barnard, W. Wang, A. Hilton, and J. Kittler, Mean-shift and sparse sampling-based SMC-PHD filtering for audio informed visual speaker tracking, IEEE Transactions on Multimedia, vol.18, issue.12, pp.2417-2431, 2016.

V. K?l?ç, M. Barnard, W. Wang, A. Hilton, and J. Kittler, Mean-shift and sparse sampling-based SMC-PHD filtering for audio informed visual speaker tracking, IEEE Transactions on Multimedia, vol.18, issue.12, pp.2417-2431, 2016.

S. Ba, X. Alameda-pineda, A. Xompero, and R. Horaud, An on-line variational Bayesian model for multi-person tracking from cluttered scenes, Computer Vision and Image Understanding, vol.153, pp.64-76, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01349763

S. Bae and K. Yoon, Confidence-based data association and discriminative deep appearance learning for robust online multi-object tracking, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.40, issue.3, pp.595-610, 2018.

J. Valin, F. Michaud, and J. Rouat, Robust localization and tracking of simultaneous moving sound sources using beamforming and particle filtering, Robotics and Autonomous Systems, vol.55, issue.3, pp.216-228, 2007.

A. Lombard, Y. Zheng, H. Buchner, and W. Kellermann, TDOA estimation for multiple sound sources in noisy and reverberant environments using broadband independent component analysis, IEEE Transactions on Audio, Speech, and Language Processing, vol.19, issue.6, pp.1490-1503, 2011.

X. Alameda-pineda and R. Horaud, A geometric approach to sound source localization from time-delay estimates, Speech, and Language Processing, vol.22, pp.1082-1095, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00910081

Y. Dorfan and S. Gannot, Tree-based recursive expectationmaximization algorithm for localization of acoustic sources, IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol.23, issue.10, pp.1692-1703, 2015.

X. Li, L. Girin, R. Horaud, and S. Gannot, Estimation of the directpath relative transfer function for supervised sound-source localization, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.24, issue.11, pp.2171-2186, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01349691

X. Li, L. Girin, R. Horaud, S. Gannot, X. Li et al., Multiple-speaker localization based on direct-path features and likelihood maximization with spatial sparsity regularization, IEEE/ACM Transactions on Audio, Speech and Language Processing, vol.25, issue.10, 1997.
URL : https://hal.archives-ouvertes.fr/hal-01413417

A. Deleforge, R. Horaud, Y. Y. Schechner, and L. Girin, Co-localization of audio sources in images using binaural features and locally-linear regression, IEEE/ACM Transactions on Audio, Speech and Language Processing, vol.23, issue.4, pp.718-731, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01112834

B. Gold, N. Morgan, and D. Ellis, Speech and audio signal processing: processing and perception of speech and music, 2011.

G. Lathoud, J. Odobez, and D. Gatica-perez, AV16.3: An audiovisual corpus for speaker localization and tracking, Machine Learning for Multimodal Interaction, pp.182-195, 2004.

I. D. Gebru, S. Ba, X. Li, and R. Horaud, Audio-visual speaker diarization based on spatiotemporal Bayesian fusion, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.40, issue.5, pp.1086-1099, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01413403

B. Ristic, B. Vo, D. Clark, and B. Vo, A metric for performance evaluation of multi-target tracking algorithms, IEEE Transactions on Signal Processing, vol.59, issue.7, pp.3452-3457, 2011.

D. Vijayasenan and F. Valente, DiarTk: an open source toolkit for research in multistream speaker diarization and its application to meeting recordings, INTERSPEECH, pp.2170-2173, 2012.

N. Checka, K. Wilson, M. Siracusa, and T. Darrell, Multiple person and speaker activity tracking with a particle filter, IEEE International Conference on Acoustics, Speech, and Signal Processing, pp.881-884, 2004.

Y. Liu, W. Wang, J. Chambers, V. Kilic, and A. Hilton, Particle flow SMC-PHD filter for audio-visual multi-speaker tracking, International Conference on Latent Variable Analysis and Signal Separation, pp.344-353, 2017.

Y. Liu, A. Hilton, J. Chambers, Y. Zhao, and W. Wang, Non-zero diffusion particle flow SMC-PHD filter for audio-visual multi-speaker tracking, IEEE International Conference on Acoustics, Speech, and Signal Processing, pp.4304-4308, 2018.

X. Qian, A. Brutti, M. Omologo, and A. Cavallaro, 3D audio-visual speaker tracking with an adaptive particle filter, IEEE International Conference on Acoustics, Speech and Signal Processing, pp.2896-2900, 2017.

I. D. Gebru, X. Alameda-pineda, F. Forbes, and R. Horaud, EM algorithms for weighted-data clustering with application to audio-visual scene analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.38, issue.12, pp.2402-2415, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01261374

X. Li, Y. Ban, L. Girin, X. Alameda-pineda, and R. Horaud, Online Localization and Tracking of Multiple Moving Speakers in Reverberant Environments, IEEE Journal of Selected Topics in Signal Processing, vol.13, issue.1, pp.88-103, 2019.
URL : https://hal.archives-ouvertes.fr/hal-01851985

Y. Ban, X. Alameda-pineda, C. Evers, and R. Horaud, Tracking Multiple Audio Sources with the Von Mises Distribution and Variational EM, IEEE Signal Processing Letters, vol.26, issue.6, pp.798-802, 2019.
URL : https://hal.archives-ouvertes.fr/hal-01969050

Y. Ban, L. Girin, X. Alameda-pineda, and R. Horaud, Exploiting the complementarity of audio and visual data in multi-speaker tracking, IEEE ICCV Workshop on Computer Vision for Audio-Visual Media, pp.446-454, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01577965

Y. Ban, X. Li, X. Alameda-pineda, L. Girin, and R. Horaud, Accounting for room acoustics in audio-visual multi-speaker tracking, IEEE International Conference on Acoustics, Speech and Signal Processing, pp.6553-6557, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01718114

A. Bhattacharyya, On a measure of divergence between two statistical populations defined by their probability distributions, Bull. Calcutta Math. Soc, vol.35, pp.99-109, 1943.

A. Deleforge, F. Forbes, and R. Horaud, High-dimensional regression with Gaussian mixtures and partially-latent response variables, Statistics and Computing, vol.25, issue.5, pp.893-911, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01107604

C. Bishop, Pattern Recognition and Machine Learning, 2006.

V. Smidl and A. Quinn, The Variational Bayes Method in Signal Processing, 2006.

X. Miro, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland et al., Speaker diarization: A review of recent research, IEEE Transactions on Audio, Speech, and Language Processing, vol.20, issue.2, pp.356-370, 2012.

A. Noulas, G. Englebienne, and B. J. Krose, Multimodal speaker diarization, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.34, issue.1, pp.79-93, 2012.

G. Lathoud and M. Magimai-doss, A sector-based, frequency-domain approach to detection and localization of multiple speakers, IEEE International Conference on Acoustics, Speech, and Signal Processing, vol.3, pp.265-268, 2005.

Z. Cao, T. Simon, S. Wei, and Y. Sheikh, Realtime multi-person 2D pose estimation using part affinity fields, IEEE Conference on Computer Vision and Pattern Recognition, pp.7291-7299, 2017.

L. Zheng, H. Zhang, S. Sun, M. Chandraker, Y. Yang et al., Person re-identification in the wild, IEEE Conference on Computer Vision and Pattern Recognition, pp.1367-1376, 2017.

A. Milan, L. Leal-taixé, I. Reid, S. Roth, and K. Schindler, Mot16: A benchmark for multi-object tracking, 2016.

V. P. Minotto, C. R. Jung, and B. Lee, Multimodal multi-channel on-line speaker diarization using sensor fusion through SVM, IEEE Transactions on Multimedia, vol.17, issue.10, pp.1694-1705, 2015.