Finding Groups of Duplicate Images In Very Large Dataset

Winn Voravuthikunchai; Bruno Crémilleux; Frédéric Jurie

Conference Papers Year : 2012

Finding Groups of Duplicate Images In Very Large Dataset

(1) , (2) , (1)

1
2

Winn Voravuthikunchai

Function : Author

Equipe Image - Laboratoire GREYC - UMR6072

Bruno Crémilleux

Function : Author
PersonId : 15791
IdHAL : bruno-cremilleux
ORCID : 0000-0001-8294-9049
IdRef : 083548335

Equipe CODAG - Laboratoire GREYC - UMR6072

Frédéric Jurie

Function : Author
PersonId : 3233
IdHAL : frederic-jurie
ORCID : 0000-0002-2686-0020
IdRef : 080485022

Equipe Image - Laboratoire GREYC - UMR6072

Abstract

This paper addresses the problem of detecting groups of duplicates in large-scale unstructured image datasets such as the Internet. Leveraging the recent progress in data mining, we propose an efficient approach based on the search of closed patterns. Moreover, we present a novel way to encode the bag-of-words image representation into data mining transactions. We validate our approach on a new dataset of one million Internet images obtained with random searches on Google image search. Using the proposed method, we find more than 80 thousands groups of duplicates among the one million images in less than three minutes while using only 150 Megabytes of memory. Unlike other existing approaches, our method can scale gracefully to larger datasets as it has linear time and space (memory) complexities. Furthermore, the approach does not need (to build or use) any precomputed indexing structure.

Domains

Image Processing [eess.IV] Machine Learning [cs.LG]

Fichier principal

12_bmvc_LargeDataetsMining.pdf (4.5 Mo)

Origin : Files produced by the author(s)

Yvain Queau : Connect in order to contact the contributor

https://hal.science/hal-00806196

Submitted on : Friday, March 29, 2013-3:22:47 PM

Last modification on : Wednesday, March 20, 2024-4:20:04 PM

Long-term archiving on: Sunday, April 2, 2017-10:43:12 PM

Dates and versions

hal-00806196 , version 1 (29-03-2013)

Identifiers

HAL Id : hal-00806196 , version 1

Cite

Winn Voravuthikunchai, Bruno Crémilleux, Frédéric Jurie. Finding Groups of Duplicate Images In Very Large Dataset. Proceedings of the British Machine Vision Conference (BMVC 2012), Sep 2012, Guildford, United Kingdom. pp.105.1--105.12. ⟨hal-00806196⟩

Export

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS GREYC GREYC-CODAG GREYC-IMAGE COMUE-NORMANDIE ENSICAEN UNICAEN

316 View

356 Download

Finding Groups of Duplicate Images In Very Large Dataset

Abstract

Domains

Dates and versions

Identifiers

Cite

Export

Collections

Share