Joining Distributed Database Summaries

Abstract : The database summarization system coined SaintEtiQ provides multi-level summaries of tabular data stored into a centralized database. Summaries are computed online with a conceptual hierarchical clustering algorithm. However, in many companies, data are distributed among several sites, either homogeneously (i.e. , sites contain data for a common set of features) or heterogeneously (i.e. , sites contain data for different features). Consequently, the current centralized version of SaintEtiQ is either not feasible or even not desirable due to privacy or resource issues. In this paper, we propose two new algorithms for summarizing heterogeneously distributed data without a prior "unification" of the data sources: Subspace-Oriented Join Algorithm (SOJA) and Tree Alignement-based Join Algorithm (TAJA). The main idea of such algorithms consists in applying innovative joins on two local models, computed over two disjoint sets of features, to provide a global summary over the full feature set without scanning the raw data. SOJA takes one of the two input trees as the base model and the other one is processed to complete the first one, whereas TAJA rearranges summaries by levels in a top-down manner. Then, we propose a consistent quality measure to quantify how good our joined hierarchies are. Finally, an experimental study, using synthetic data sets, shows that our joining processes (SOJA and TAJA) result in high quality clustering schemas of the entire distributed data and are very efficient in terms of computational time w.r.t. the centralized approach.
Type de document :
Rapport
[Research Report] RR-6768, INRIA. 2008, pp.29
Liste complète des métadonnées

https://hal.inria.fr/inria-00346528
Contributeur : Guillaume Raschia <>
Soumis le : jeudi 11 décembre 2008 - 17:01:57
Dernière modification le : mercredi 11 avril 2018 - 01:56:31
Document(s) archivé(s) le : mardi 8 juin 2010 - 16:36:44

Fichier

RR-6768.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : inria-00346528, version 1

Collections

Citation

Mounir Bechchi, Guillaume Raschia, Noureddine Mouaddib. Joining Distributed Database Summaries. [Research Report] RR-6768, INRIA. 2008, pp.29. 〈inria-00346528〉

Partager

Métriques

Consultations de la notice

298

Téléchargements de fichiers

235