DSS: A Scalable and Efficient Stratified Sampling Algorithm for Large-Scale Datasets

Abstract : Statistical analysis of aggregated records is widely used in various domains such as market research, sociological investigation and network analysis, etc. Stratified sampling (SS), which samples the population divided into distinct groups separately, is preferred in the practice for its high effectiveness and accuracy. In this paper, we propose a scalable and efficient algorithm named DSS, for SS to process large datasets. DSS executes all the sampling operations in parallel by calculating the exact subsample size for each partition according to the data distribution. We implement DSS on Spark, a big-data processing system, and we show through large-scale experiments that it can achieve lower data-transmission cost and higher efficiency than state-of-the-art methods with high sample representativeness.
Type de document :
Communication dans un congrès
Guang R. Gao; Depei Qian; Xinbo Gao; Barbara Chapman; Wenguang Chen. 13th IFIP International Conference on Network and Parallel Computing (NPC), Oct 2016, Xi'an, China. Springer International Publishing, Lecture Notes in Computer Science, LNCS-9966, pp.133-146, 2016, Network and Parallel Computing. 〈10.1007/978-3-319-47099-3_11〉
Liste complète des métadonnées

Littérature citée [21 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-01648006
Contributeur : Hal Ifip <>
Soumis le : vendredi 24 novembre 2017 - 16:49:17
Dernière modification le : vendredi 24 novembre 2017 - 16:50:58

Fichier

 Accès restreint
Fichier visible le : 2019-01-01

Connectez-vous pour demander l'accès au fichier

Licence


Distributed under a Creative Commons Paternité 4.0 International License

Identifiants

Citation

Minne Li, Dongsheng Li, Siqi Shen, Zhaoning Zhang, Xicheng Lu. DSS: A Scalable and Efficient Stratified Sampling Algorithm for Large-Scale Datasets. Guang R. Gao; Depei Qian; Xinbo Gao; Barbara Chapman; Wenguang Chen. 13th IFIP International Conference on Network and Parallel Computing (NPC), Oct 2016, Xi'an, China. Springer International Publishing, Lecture Notes in Computer Science, LNCS-9966, pp.133-146, 2016, Network and Parallel Computing. 〈10.1007/978-3-319-47099-3_11〉. 〈hal-01648006〉

Partager

Métriques

Consultations de la notice

48