Monjazeb, Niki (2024) Automated Data Preparation using Graph Neural Networks. Masters thesis, Concordia University.
Preview |
Text (application/pdf)
2MBMonjazeb_MASc_F2024.pdf - Accepted Version Available under License Spectrum Terms of Access. |
Abstract
The process of data preparation is a time-consuming portion of data scientists' work. Being able to automate this work will improve the quality of the machine learning results and free data scientists to shift their focus to the machine learning task at hand. My research presents a system to automate this process by learning from the data preparation steps taken from others working on similar datasets. To automate data cleaning and transformation, datasets and their corresponding notebooks were extracted from Kaggle, their information was abstracted before being uploaded into a knowledge graph. Graph Neural Network (GNN) models were trained on those knowledge graphs, and the most commonly used cleaning and transformation operations for similar datasets were inferred. These operations are offered to the user as recommendations that they can apply to their dataset using the corresponding APIs. These recommendations have outperformed their state-of-the-arts counterparts in terms of time, memory consumption, and accuracy. To detect similarity inclusion dependencies (sIND), knowledge graphs from datasets in the Prague Relational Learning Repository were created. From those knowledge graphs, the columns deemed to have an inclusion dependency were studied until features leading to this dependency were observed. These features were used to create a model that could predict the sIND between columns. The resulting model was able to correctly predict more sIND pairs, in a shorter timespan than its competitor. This holistic platform can easily be integrated into any Data Science Pipeline (DSP) and facilitate the data preparation process for data scientists.
Divisions: | Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering |
---|---|
Item Type: | Thesis (Masters) |
Authors: | Monjazeb, Niki |
Institution: | Concordia University |
Degree Name: | M. Comp. Sc. |
Program: | Computer Science |
Date: | 16 August 2024 |
Thesis Supervisor(s): | Mansour, Essam |
ID Code: | 994463 |
Deposited By: | NIKI MONJAZEB |
Deposited On: | 24 Oct 2024 16:22 |
Last Modified: | 24 Oct 2024 16:22 |
Repository Staff Only: item control page