Login | Register

Machine Learning for Fault Detection in Clouds

Title:

Machine Learning for Fault Detection in Clouds

Abbasi halehtakiG, Razieh and Glitho, Roch (2023) Machine Learning for Fault Detection in Clouds. Masters thesis, Concordia University.

Full text not available from this repository.

Abstract

Cloud computing is recognized with several important benefits, including elastic architecture, easy accessibility, scalability, and flexibility. These benefits enable a wide variety of services and applications. These services, however, are prone to faults. A fault is a
condition where the system fails to perform its required functionality. Once a fault occurs, it may propagate through cloud environments, and it can result in service interruption, which can eventually lead to the unavailability of cloud services. Given the large-scale cloud environments and the diversity of faults that may occur in clouds, automated fault detection with limited or no human intervention is crucial. Machine Learning (ML) is the key enabler technology that can deliver an automated means to provide a more reliable
network by detecting faults. However, the data that is gathered from cloud computing environments limits the applicability of ML-based solutions due to several reasons. Firstly, the cloud data typically includes several features, of which some features may have higher
impacts on the accuracy of the fault detection system than other features. Second, some features may be missing/unknown, which can happen due to the absence of traffic, maintenance, and/or errors in collecting information about the state of network devices (e.g.,
faults in monitoring equipment or applications). The values of such missing features may be necessary in order to build an accurate fault detection method, as missing informative feature values can affect the ultimate fault detection outcome. In addition, even if the values of all features are known, calculating the required number of features for fault detection is important. In fact, if the number of the selected features is too large, there may be many overlapping features, which can lead to the so-called over-fitting issue. On the other hand, if the number of features is too small, it causes accuracy degradation. To address the above-mentioned challenges, a proper feature selection method is required not only to estimate the values of the unknown feature values but also to select the most important features with the aim of improving fault detection accuracy. In this thesis, we propose a context-aware feature selection method, which exploits a sensitivity analysis to
measure the impact of each feature on the output prediction. Also, our feature selection method is able to estimate the value of missing feature values and their sensitivity using time-series, multivariate and incomplete feature vectors as input. Our proposed feature selection method comprises Recurrent Neural Network (RNN) and Denoising Auto-Encoder (DAE) stacked with an RNN and Discriminative Model (DM). The RNN and DAE are responsible for dealing with time-series data and missing feature values, respectively, while RNN DM is used for predicting the system status. We have evaluated the performance
of our proposed framework on real-world Google cluster data. Based on the simulation results, it is evident that our proposed feature selection method plays a significant role in providing accurate fault detection in terms of F1-score.

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science > Electrical and Computer Engineering
Item Type:Thesis (Masters)
Authors:Abbasi halehtakiG, Razieh and Glitho, Roch
Institution:Concordia University
Degree Name:M.A.
Program:Electrical and Computer Engineering
Date:2023
Thesis Supervisor(s):Glitho, Roch
ID Code:993044
Deposited By: razieh abbasi ghalehtaki
Deposited On:15 Nov 2023 15:20
Last Modified:15 Nov 2023 15:20
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Research related to the current document (at the CORE website)
- Research related to the current document (at the CORE website)
Back to top Back to top