Helal, Ahmed (2021) Towards Empowering Data Lakes with Knowledge Graphs. Masters thesis, Concordia University.
Preview |
Text (application/pdf)
3MBHelal_MA_S2021.pdf - Accepted Version Available under License Spectrum Terms of Access. |
Abstract
The emergence of data lakes has permitted storing a large amount of data coming in different formats and at high speed. Data lakes are simultaneously a boon and a bane: while they are great data stores, it is tedious to explore their content. In fact, data lakes are schema-agnostic. In other words, they come with limited or no metadata, making consequently data discovery time-consuming and cumbersome. In addition, some of the already existing data lakes, like the open data portals, have few functionalities that a user can instrumentalize to look for datasets. In addition, these functionalities merely consist of basic search coupled with some filters. These limitations are costly because users would spend considerable time looking for data rather than working on their main tasks. To mitigate this shortcoming, this thesis presents an approach to create metadata on top of the content of data lakes to facilitate data discovery and data enrichment. This approach consists of two steps: First, constructing an RDF knowledge graph (KG) as a navigational structure to model the schema. Second, providing the user with a set of APIs to discover and enrich data. To demonstrate this approach, this work will present a proof of concept (POC) system that captures the schema of tabular-like data and represent it as a KG (GLac), with the means of LAC, an ontology for data lakes. Then it will equip the practitioners with user-friendly interface services to interact with GLac and compile a dataset for a given task. With these main contributions, the system offers promising results in terms of the quality of the generated schema.
The main findings of this thesis have been published in two venues: as an extended abstract named 'Data Lakes Empowered by Knowledge Graphs' and 'A Demonstration of KGLac: A data Discovery an Enrichment Platform for Data Science'. The former, accepted to the poster session of SIGMOD/PODS'21, presents an approach describing how to utilize KGs to facilitate leveraging the content of data lakes. The latter, accepted to the demo session of VLDB'21, provides an overview of KGLac and illustrates the various functionalities the platform supports on top of data lakes after processing their content.
Divisions: | Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering |
---|---|
Item Type: | Thesis (Masters) |
Authors: | Helal, Ahmed |
Institution: | Concordia University |
Degree Name: | M. Comp. Sc. |
Program: | Computer Science |
Date: | 10 August 2021 |
Thesis Supervisor(s): | Mansour, Essam |
ID Code: | 988756 |
Deposited By: | Ahmed Helal |
Deposited On: | 29 Nov 2021 16:49 |
Last Modified: | 29 Nov 2021 16:49 |
Repository Staff Only: item control page