Skip to Main content Skip to Navigation

Mining and ranking closed itemsets from large-scale transactional datasets

Martin Kirchgessner 1 
1 SLIDE - ScaLable Information Discovery and Exploitation [Grenoble]
LIG - Laboratoire d'Informatique de Grenoble
Abstract : The recent increase of data volumes raises new challenges for itemset mining algorithms. In this thesis, we focus on transactional datasets (collections of items sets, for example supermarket tickets) containing at least a million transactions over hundreds of thousands items. These datasets usually follow a “long tail” distribution: a few items are very frequent, and most items appear rarely. Such distributions are often truncated by existing itemset mining algorithms, whose results concern only a very small portion of the available items (the most frequents, usually). Thus, existing methods fail to concisely provide relevant insights on large datasets. We therefore introduce a new semantics which is more intuitive for the analyst: browsing associations per item, for any item, and less than a hundred associations at once. To address the items’ coverage challenge, our first contribution is the item- centric mining problem. It consists in computing, for each item in the dataset, the k most frequent closed itemsets containing this item. We present an algorithm to solve it, TopPI. We show that TopPI computes efficiently interesting results over our datasets, outperforming simpler solutions or emulations based on existing algorithms, both in terms of run-time and result completeness. We also show and empirically validate how TopPI can be parallelized, on multi-core machines and on Hadoop clusters, in order to speed-up computation on large scale datasets. Our second contribution is CAPA, a framework allowing us to study which existing measures of association rules’ quality are relevant to rank results. This concerns results obtained from TopPI or from j LCM, our implementation of a state-of-the-art frequent closed itemsets mining algorithm (LCM). Our quantita- tive study shows that the 39 quality measures we compare can be grouped into 5 families, based on the similarity of the rankings they produce. We also involve marketing experts in a qualitative study, in order to discover which of the 5 families we propose highlights the most interesting associations for their domain. Our close collaboration with Intermarché, one of our industrial partners in the Datalyse project, allows us to show extensive experiments on real, nation-wide supermarket data. We present a complete analytics workflow addressing this use case. We also experiment on Web data. Our contributions can be relevant in various other fields, thanks to the genericity of transactional datasets. Altogether our contributions allow analysts to discover associations of interest in modern datasets. We pave the way for a more reactive discovery of items’ asso- ciations in large-scale datasets, whether on highly dynamic data or for interactive exploration systems.
Document type :
Complete list of metadata

Cited literature [73 references]  Display  Hide  Download
Contributor : Martin Kirchgessner Connect in order to contact the contributor
Submitted on : Monday, October 10, 2016 - 5:45:41 PM
Last modification on : Wednesday, July 6, 2022 - 4:19:30 AM
Long-term archiving on: : Saturday, February 4, 2017 - 1:30:24 AM


  • HAL Id : tel-01378794, version 1



Martin Kirchgessner. Mining and ranking closed itemsets from large-scale transactional datasets. Databases [cs.DB]. Université Grenoble Alpes, 2016. English. ⟨tel-01378794⟩



Record views


Files downloads