Mofidpoor, Mahsa (2013) Index-based Join Operations in Hive. Masters thesis, Concordia University.
Preview |
Text (application/pdf)
2MBMofidpoor_MSc_S2013.pdf - Accepted Version Available under License Spectrum Terms of Access. |
Abstract
ABSTRACT
INDEX-BASED JOIN OPERATIONS IN HIVE
MAHSA MOFIDPOOR
The exponential growth of data being generated, manipulated, analyzed, and archived nowadays introduces new challenges and opportunities for dealing with the so called big data. Hive is a batch-oriented big data software, well suited for query processing and data analysis. Originally developed by Facebook in 2009 and now under the Apache Software Foundation, Hive is gaining popularity for its SQL like query language HiveQL and for supporting majority of the SQL operations in relational database management systems (RDBMS). Being the expensive operation in RDBMS, join has been the focus of many query optimization techniques to improve performance of database systems. We investigate such techniques for join operations in Hive and develop an index-based join algorithm for queries in HiveQL. When a query requires only a small subset of data selected by a predicate in the WHERE clause, the brute-force method which scans the entire tables results in poor performance for redundant disk I/Os, and irrelevant maps initiation in case the query is issued using the mapreduce.
In this work, we implement the proposed index-based technique and integrate it in Hive. To add our extension, we obtain Hive architecture details by reverse engineering the code and map our design to the conceptual optimization flow.To evaluate the performance, after setting up the environment, we run relevant test queries on datasets generated using the industry standard benchmark, TPC-H. Our results indicate significant performance gain over relatively large data or highly selective queries.
Divisions: | Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering |
---|---|
Item Type: | Thesis (Masters) |
Authors: | Mofidpoor, Mahsa |
Institution: | Concordia University |
Degree Name: | M. Sc. |
Program: | Computer Science |
Date: | April 2013 |
ID Code: | 977192 |
Deposited By: | MAHSA MOFID POOR |
Deposited On: | 13 Jun 2013 20:32 |
Last Modified: | 18 Jan 2018 17:44 |
Repository Staff Only: item control page