Skip to Main content Skip to Navigation
Conference papers

Reuse-based Optimization for Pig Latin

Jesús Camacho-Rodríguez 1, 2 Dario Colazzo 3 Melanie Herschel 2, 1 Ioana Manolescu 1, 2 Soudip Roy Chowdhury 1, 2
1 OAK - Database optimizations and architectures for complex large data
LRI - Laboratoire de Recherche en Informatique, UP11 - Université Paris-Sud - Paris 11, Inria Saclay - Ile de France, CNRS - Centre National de la Recherche Scientifique : UMR8623
Abstract : Pig Latin has become a popular language within the data management community interested in the efficient parallel processing of large data volumes. The dataflow-style primi-tives of Pig Latin provide an intuitive way for users to write complex analytical queries, which are in turn compiled into MapReduce jobs. Currently, subexpressions occurring repeatedly in Pig Latin scripts are executed as many times as they occur, leading to avoidable MapReduce jobs. The current Pig Latin optimizer is not capable of recognizing, and thus optimizing, such repeated subexpressions. We present a novel approach for identifying and reusing common subexpressions occurring in Pig Latin scripts. In particular, we lay the foundation of our reuse-based algo-rithms by formalizing the semantics of the Pig Latin query language with extended nested relational algebra for bags. Our algorithm, named PigReuse, operates on the algebraic representations of Pig Latin scripts, identifies subexpression merging opportunities, selects the best ones to execute based on a cost function, and merges other equivalent expressions to share its result. Our experimental results demonstrate the efficiency and effectiveness of our reuse-based algorithms and optimization strategies.
Complete list of metadata

Cited literature [26 references]  Display  Hide  Download
Contributor : Soudip Roy Chowdhury Connect in order to contact the contributor
Submitted on : Monday, November 24, 2014 - 2:39:31 PM
Last modification on : Wednesday, November 17, 2021 - 12:32:15 PM
Long-term archiving on: : Friday, April 14, 2017 - 8:34:55 PM


Files produced by the author(s)


  • HAL Id : hal-01086497, version 1


Jesús Camacho-Rodríguez, Dario Colazzo, Melanie Herschel, Ioana Manolescu, Soudip Roy Chowdhury. Reuse-based Optimization for Pig Latin. BDA'2014: 30e journées Bases de Données Avancées, Oct 2014, Grenoble-Autrans, France. ⟨hal-01086497⟩



Les métriques sont temporairement indisponibles