DSS: A Scalable and Efficient Stratified Sampling Algorithm for Large-Scale Datasets - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2016

DSS: A Scalable and Efficient Stratified Sampling Algorithm for Large-Scale Datasets

Résumé

Statistical analysis of aggregated records is widely used in various domains such as market research, sociological investigation and network analysis, etc. Stratified sampling (SS), which samples the population divided into distinct groups separately, is preferred in the practice for its high effectiveness and accuracy. In this paper, we propose a scalable and efficient algorithm named DSS, for SS to process large datasets. DSS executes all the sampling operations in parallel by calculating the exact subsample size for each partition according to the data distribution. We implement DSS on Spark, a big-data processing system, and we show through large-scale experiments that it can achieve lower data-transmission cost and higher efficiency than state-of-the-art methods with high sample representativeness.
Fichier principal
Vignette du fichier
432484_1_En_11_Chapter.pdf (1.03 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-01648006 , version 1 (24-11-2017)

Licence

Paternité

Identifiants

Citer

Minne Li, Dongsheng Li, Siqi Shen, Zhaoning Zhang, Xicheng Lu. DSS: A Scalable and Efficient Stratified Sampling Algorithm for Large-Scale Datasets. 13th IFIP International Conference on Network and Parallel Computing (NPC), Oct 2016, Xi'an, China. pp.133-146, ⟨10.1007/978-3-319-47099-3_11⟩. ⟨hal-01648006⟩
92 Consultations
293 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More