S. Parekh, S. Essid, A. Ozerov, N. Q. Duong, P. Pérez et al., Weakly supervised representation learning for unsynchronized audio-visual events, CoRR, 2018.

S. Parekh, S. Essid, A. Ozerov, N. Q. Duong, P. Pérez et al., Weakly supervised representation learning for unsynchronized audio-visual events, Workshop on Sight and Sound, CVPR, 2018.

A. Mesaros, T. Heittola, O. Dikmen, and T. Virtanen, Sound event detection in real life recordings using coupled matrix factorization of spectral representations and class activity annotations, ICASSP, pp.151-155, 2015.

X. Zhuang, X. Zhou, A. Mark, T. Hasegawa-johnson, and . Huang, Real-world acoustic event detection, Pattern Recognition Letters, vol.31, issue.12, pp.1543-1551, 2010.

P. Sharath-adavanne, T. Pertilä, and . Virtanen, Sound event detection using spatial features and convolutional recurrent neural network," in ICASSP, pp.771-775, 2017.

V. Bisot, R. Serizel, S. Essid, and G. Richard, Feature learning with matrix factorization applied to acoustic scene classification, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.25, issue.6, pp.1216-1229, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01362864

J. Xu, A. G. Schwing, and R. Urtasun, Learning to segment under various forms of weak supervision, CVPR, pp.3781-3790, 2015.
DOI : 10.1109/cvpr.2015.7299002

M. Gao, Z. Xu, L. Lu, A. Wu, I. Nogues et al., Segmentation label propagation using deep convolutional neural networks and dense conditional random field, ISBI, pp.1265-1268, 2016.
DOI : 10.1109/isbi.2016.7493497

. Thomas-g-dietterich, H. Richard, T. Lathrop, and . Lozanopérez, Solving the multiple instance problem with axisparallel rectangles, Artificial intelligence, vol.89, issue.1-2, pp.31-71, 1997.

Y. Tian, J. Shi, B. Li, Z. Duan, and C. Xu, Audio-visual event localization in unconstrained videos, ECCV, 2018.
DOI : 10.1007/978-3-030-01216-8_16
URL : http://arxiv.org/pdf/1803.08842

D. Daniel, H. Lee, and . Seung, Algorithms for nonnegative matrix factorization, Advances in neural information processing systems, pp.556-562, 2001.

A. Ozerov and C. Févotte, Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation, Speech, and Language Processing, vol.18, pp.550-563, 2010.
DOI : 10.1109/tasl.2009.2031510

V. Bisot, S. Essid, and G. Richard, Overlapping sound event detection with supervised nonnegative matrix factorization," in ICASSP, pp.31-35, 2017.
DOI : 10.1109/icassp.2017.7951792

T. Heittola, A. Mesaros, T. Virtanen, and A. Eronen, Sound event detection in multisource environments using source separation, Machine Listening in Multisource Environments, 2011.

R. Gao, R. Feris, and K. Grauman, Learning to separate object sounds by watching unlabeled video, ECCV, 2018.
DOI : 10.1007/978-3-030-01219-9_3
URL : http://arxiv.org/pdf/1804.01665

H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. Mcdermott et al., The sound of pixels, ECCV, 2018.
DOI : 10.1007/978-3-030-01246-5_35

A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson et al., Looking to listen at the cocktail party: A speakerindependent audio-visual model for speech separation, ACM Trans. Graph, vol.37, issue.4, pp.1-112, 2018.
DOI : 10.1145/3197517.3201357
URL : http://dl.acm.org/ft_gateway.cfm?id=3201357&type=pdf

L. Zitnick and P. Dollár, Edge boxes: Locating object proposals from edges, ECCV, pp.391-405, 2014.
DOI : 10.1007/978-3-319-10602-1_26
URL : http://research.microsoft.com/en-us/um/people/larryz/ZitnickDollarECCV14edgeBoxes.pdf

J. Deng, W. Dong, R. Socher, L. Li, K. Li et al., Imagenet: A large-scale hierarchical image database, Computer Vision and Pattern Recognition, pp.248-255, 2009.

C. Févotte, N. Bertin, and J. Durrieu, Nonnegative matrix factorization with the itakura-saito divergence: With application to music analysis, Neural computation, vol.21, issue.3, pp.793-830, 2009.

S. Hershey, S. Chaudhuri, P. W. Daniel, . Ellis, A. Jort-f-gemmeke et al., CNN architectures for large-scale audio classification, ICASSP, pp.131-135, 2017.
DOI : 10.1109/icassp.2017.7952132
URL : http://arxiv.org/pdf/1609.09430

S. Abu-el-haija, N. Kothari, J. Lee-;-paul)-natsev, G. Toderici, B. Varadarajan et al., Youtube-8M: A largescale video classification benchmark, Apostol, 2016.

H. Bilen and A. Vedaldi, Weakly supervised deep detection networks, CVPR, pp.2846-2854, 2016.
DOI : 10.1109/cvpr.2016.311

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier et al., The kinetics human action video dataset, 2017.

C. Févotte, E. Vincent, and A. Ozerov, Single-channel audio source separation with NMF: divergences, constraints and algorithms, Audio Source Separation, pp.1-24, 2018.

M. Spiertz and . Volker-gnann, Source-filter based clustering for monaural blind source separation, Proceedings of International Conference on Digital Audio Effects DAFx09, 2009.

, NMF Mel Clustering Code

R. Girshick, Fast R-CNN," in ICCV, pp.1440-1448, 2015.
DOI : 10.1109/iccv.2015.169

V. Code,

D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D. Plumbley, Detection and classification of acoustic scenes and events, IEEE Transactions on Multimedia, vol.17, issue.10, pp.1733-1746, 2015.
DOI : 10.1109/tmm.2015.2428998
URL : https://hal.archives-ouvertes.fr/hal-01123760

E. Vincent, R. Gribonval, and C. Févotte, Performance measurement in blind audio source separation, IEEE transactions on audio, speech, and language processing, vol.14, pp.1462-1469, 2006.
DOI : 10.1109/tsa.2005.858005
URL : https://hal.archives-ouvertes.fr/inria-00544230