Login | Register

Automated Data Preparation using Graph Neural Networks

Title:

Automated Data Preparation using Graph Neural Networks

Monjazeb, Niki (2024) Automated Data Preparation using Graph Neural Networks. Masters thesis, Concordia University.

[thumbnail of Monjazeb_MASc_F2024.pdf]
Preview
Text (application/pdf)
Monjazeb_MASc_F2024.pdf - Accepted Version
Available under License Spectrum Terms of Access.
2MB

Abstract

The process of data preparation is a time-consuming portion of data scientists' work. Being able to automate this work will improve the quality of the machine learning results and free data scientists to shift their focus to the machine learning task at hand. My research presents a system to automate this process by learning from the data preparation steps taken from others working on similar datasets. To automate data cleaning and transformation, datasets and their corresponding notebooks were extracted from Kaggle, their information was abstracted before being uploaded into a knowledge graph. Graph Neural Network (GNN) models were trained on those knowledge graphs, and the most commonly used cleaning and transformation operations for similar datasets were inferred. These operations are offered to the user as recommendations that they can apply to their dataset using the corresponding APIs. These recommendations have outperformed their state-of-the-arts counterparts in terms of time, memory consumption, and accuracy. To detect similarity inclusion dependencies (sIND), knowledge graphs from datasets in the Prague Relational Learning Repository were created. From those knowledge graphs, the columns deemed to have an inclusion dependency were studied until features leading to this dependency were observed. These features were used to create a model that could predict the sIND between columns. The resulting model was able to correctly predict more sIND pairs, in a shorter timespan than its competitor. This holistic platform can easily be integrated into any Data Science Pipeline (DSP) and facilitate the data preparation process for data scientists.

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:Thesis (Masters)
Authors:Monjazeb, Niki
Institution:Concordia University
Degree Name:M. Comp. Sc.
Program:Computer Science
Date:16 August 2024
Thesis Supervisor(s):Mansour, Essam
ID Code:994463
Deposited By: NIKI MONJAZEB
Deposited On:24 Oct 2024 16:22
Last Modified:24 Oct 2024 16:22
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Research related to the current document (at the CORE website)
- Research related to the current document (at the CORE website)
Back to top Back to top