Reuse-based Optimization for Pig Latin

Jesús Camacho-Rodríguez 1, 2 Dario Colazzo 3 Melanie Herschel 2, 1 Ioana Manolescu 1, 2 Soudip Roy Chowdhury 1, 2
1 OAK - Database optimizations and architectures for complex large data
CNRS - Centre National de la Recherche Scientifique : UMR8623, Inria Saclay - Ile de France, UP11 - Université Paris-Sud - Paris 11, LRI - Laboratoire de Recherche en Informatique
Abstract : Pig Latin has become a popular language within the data management community interested in the efficient parallel processing of large data volumes. The dataflow-style primi-tives of Pig Latin provide an intuitive way for users to write complex analytical queries, which are in turn compiled into MapReduce jobs. Currently, subexpressions occurring repeatedly in Pig Latin scripts are executed as many times as they occur, leading to avoidable MapReduce jobs. The current Pig Latin optimizer is not capable of recognizing, and thus optimizing, such repeated subexpressions. We present a novel approach for identifying and reusing common subexpressions occurring in Pig Latin scripts. In particular, we lay the foundation of our reuse-based algo-rithms by formalizing the semantics of the Pig Latin query language with extended nested relational algebra for bags. Our algorithm, named PigReuse, operates on the algebraic representations of Pig Latin scripts, identifies subexpression merging opportunities, selects the best ones to execute based on a cost function, and merges other equivalent expressions to share its result. Our experimental results demonstrate the efficiency and effectiveness of our reuse-based algorithms and optimization strategies.
Complete list of metadatas

Cited literature [26 references]  Display  Hide  Download

https://hal.inria.fr/hal-01086497
Contributor : Soudip Roy Chowdhury <>
Submitted on : Monday, November 24, 2014 - 2:39:31 PM
Last modification on : Monday, May 28, 2018 - 2:38:02 PM
Long-term archiving on : Friday, April 14, 2017 - 8:34:55 PM

File

PigReuse-CR-BDA.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01086497, version 1

Collections

Citation

Jesús Camacho-Rodríguez, Dario Colazzo, Melanie Herschel, Ioana Manolescu, Soudip Roy Chowdhury. Reuse-based Optimization for Pig Latin. BDA'2014: 30e journées Bases de Données Avancées, Oct 2014, Grenoble-Autrans, France. ⟨hal-01086497⟩

Share

Metrics

Record views

833

Files downloads

447