Login | Register

Index-based Join Operations in Hive

Title:

Index-based Join Operations in Hive

Mofidpoor, Mahsa (2013) Index-based Join Operations in Hive. Masters thesis, Concordia University.

[thumbnail of Mofidpoor_MSc_S2013.pdf]
Preview
Text (application/pdf)
Mofidpoor_MSc_S2013.pdf - Accepted Version
Available under License Spectrum Terms of Access.
2MB

Abstract

ABSTRACT
INDEX-BASED JOIN OPERATIONS IN HIVE
MAHSA MOFIDPOOR
The exponential growth of data being generated, manipulated, analyzed, and archived nowadays introduces new challenges and opportunities for dealing with the so called big data. Hive is a batch-oriented big data software, well suited for query processing and data analysis. Originally developed by Facebook in 2009 and now under the Apache Software Foundation, Hive is gaining popularity for its SQL like query language HiveQL and for supporting majority of the SQL operations in relational database management systems (RDBMS). Being the expensive operation in RDBMS, join has been the focus of many query optimization techniques to improve performance of database systems. We investigate such techniques for join operations in Hive and develop an index-based join algorithm for queries in HiveQL. When a query requires only a small subset of data selected by a predicate in the WHERE clause, the brute-force method which scans the entire tables results in poor performance for redundant disk I/Os, and irrelevant maps initiation in case the query is issued using the mapreduce.
In this work, we implement the proposed index-based technique and integrate it in Hive. To add our extension, we obtain Hive architecture details by reverse engineering the code and map our design to the conceptual optimization flow.To evaluate the performance, after setting up the environment, we run relevant test queries on datasets generated using the industry standard benchmark, TPC-H. Our results indicate significant performance gain over relatively large data or highly selective queries.

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:Thesis (Masters)
Authors:Mofidpoor, Mahsa
Institution:Concordia University
Degree Name:M. Sc.
Program:Computer Science
Date:April 2013
ID Code:977192
Deposited By: MAHSA MOFID POOR
Deposited On:13 Jun 2013 20:32
Last Modified:18 Jan 2018 17:44
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Research related to the current document (at the CORE website)
- Research related to the current document (at the CORE website)
Back to top Back to top