Soualhia, Mbarka (2018) Adaptive Failure-Aware Scheduling for Hadoop. PhD thesis, Concordia University.
Preview |
Text (application/pdf)
1MBSoualhia_PhD_S2018.pdf - Accepted Version |
Abstract
Given the dynamic nature of cloud environments, failures are the norm rather than the exception in data centers powering cloud frameworks. Despite the diversity of integrated recovery mechanisms in cloud frameworks, their schedulers still generate poor scheduling decisions leading to tasks' failures due to unforeseen events such as unpredicted demands of services or hardware outages. Traditionally, simulation and analytical modeling have been widely used to analyze the impact of the scheduling decisions on the failures rates. However, they cannot provide accurate results and exhaustive coverage of the cloud systems especially when failures occur. In this thesis, we present new approaches for modeling and verifying an adaptive failure-aware scheduling algorithm for Hadoop to early detect these failures and to reschedule tasks according to changes in the cloud. Hadoop is the framework of choice on many off-the-shelf clusters in the cloud to process data-intensive applications by efficiently running them across distributed multiple machines. The proposed scheduling algorithm for Hadoop relies on predictions made by machine learning algorithms trained on previously executed tasks and data collected from the Hadoop environment. To further improve Hadoop scheduling decisions on the fly, we use reinforcement learning techniques to select an appropriate scheduling action for a scheduled task. Furthermore, we propose an adaptive algorithm to dynamically detect failures of nodes in Hadoop. We implement the above approaches in ATLAS: an AdapTive Failure-Aware Scheduling algorithm that can be built on top of existing Hadoop schedulers. To illustrate the usefulness and benefits of ATLAS, we conduct a large empirical study on a Hadoop cluster deployed on Amazon Elastic MapReduce (EMR) to compare the performance of ATLAS to those of three Hadoop scheduling algorithms (FIFO, Fair, and Capacity). Results show that ATLAS outperforms these scheduling algorithms in terms of failures' rates, execution times, and resources utilization. Finally, we propose a new methodology to formally identify the impact of the scheduling decisions of Hadoop on the failures rates. We use model checking to verify some of the most important scheduling properties in Hadoop (schedulability, resources-deadlock freeness, and fairness) and provide possible strategies to avoid their occurrences in ATLAS. The formal verification of the Hadoop scheduler allows to identify more tasks failures and hence reduce the number of failures in ATLAS.
Divisions: | Concordia University > Gina Cody School of Engineering and Computer Science > Electrical and Computer Engineering |
---|---|
Item Type: | Thesis (PhD) |
Authors: | Soualhia, Mbarka |
Institution: | Concordia University |
Degree Name: | Ph. D. |
Program: | Electrical and Computer Engineering |
Date: | 6 February 2018 |
Thesis Supervisor(s): | Tahar, Sofiene and Khomh, Foutse |
ID Code: | 983565 |
Deposited By: | MBARKA SOUALHIA |
Deposited On: | 05 Jun 2018 14:41 |
Last Modified: | 05 Jun 2018 14:41 |
Repository Staff Only: item control page