Login | Register

Understanding the Challenges and Providing Logging Support to Monitor Data Processing in Big Data Application

Title:

Understanding the Challenges and Providing Logging Support to Monitor Data Processing in Big Data Application

Wang, Zehao (2021) Understanding the Challenges and Providing Logging Support to Monitor Data Processing in Big Data Application. Masters thesis, Concordia University.

[thumbnail of Wang_MCompSc_S2021.pdf]
Preview
Text (application/pdf)
Wang_MCompSc_S2021.pdf - Accepted Version
Available under License Spectrum Terms of Access.
263kB

Abstract

To analyze large-scale data efficiently, developers have created various big data processing frameworks (e.g., Apache Spark). These big data processing frameworks provide abstractions to developers so that they can focus on implementing the logic for data analysis. In traditional software systems, developers leverage logging to monitor applications and record intermediate states to assist workload understanding and issue diagnosis. However, due to the abstraction and the peculiarity of big data frameworks, there is currently no effective monitoring approach for big data applications. In this thesis, we first manually study 1,000 randomly sampled Spark-related questions on Stack Overflow to study their root causes and the type of information, if recorded, that can assist developers with motioning and diagnosis. Then, we design an approach, DPLOG, which assists developers with monitoring Spark applications. DPLOG leverages statistical sampling to minimize performance overhead and provides intermediate information and hint/warning messages for each data processing step of a chained method pipeline. We evaluate DPLOG on six benchmarking programs and find that DPLOG has a relatively small overhead (i.e., less than 10% increase in response time when processing 5GB data) compared to without using DPLOG, and reduce the overhead by over 500% compared to the baseline. Our user study with 20 developers shows that DPLOG can reduce the needed time to debug big data applications by 63% and the participants give DPLOG 4.85/5 for its usefulness on average. Moreover, the idea of DPLOG may be applied to other big data processing frameworks, and our study sheds light on future research opportunities in assisting developers with monitoring big data applications.

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:Thesis (Masters)
Authors:Wang, Zehao
Institution:Concordia University
Degree Name:M. Comp. Sc.
Program:Computer Science
Date:April 2021
Thesis Supervisor(s):Chen, Tse-Hsun (Peter)
ID Code:988371
Deposited By: Zehao Wang
Deposited On:06 May 2022 18:54
Last Modified:06 May 2022 18:54
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Research related to the current document (at the CORE website)
- Research related to the current document (at the CORE website)
Back to top Back to top