Login | Register

Towards Empowering Data Lakes with Knowledge Graphs


Towards Empowering Data Lakes with Knowledge Graphs

Helal, Ahmed (2021) Towards Empowering Data Lakes with Knowledge Graphs. Masters thesis, Concordia University.

[thumbnail of Helal_MA_S2021.pdf]
Text (application/pdf)
Helal_MA_S2021.pdf - Accepted Version
Available under License Spectrum Terms of Access.


The emergence of data lakes has permitted storing a large amount of data coming in different formats and at high speed. Data lakes are simultaneously a boon and a bane: while they are great data stores, it is tedious to explore their content. In fact, data lakes are schema-agnostic. In other words, they come with limited or no metadata, making consequently data discovery time-consuming and cumbersome. In addition, some of the already existing data lakes, like the open data portals, have few functionalities that a user can instrumentalize to look for datasets. In addition, these functionalities merely consist of basic search coupled with some filters. These limitations are costly because users would spend considerable time looking for data rather than working on their main tasks. To mitigate this shortcoming, this thesis presents an approach to create metadata on top of the content of data lakes to facilitate data discovery and data enrichment. This approach consists of two steps: First, constructing an RDF knowledge graph (KG) as a navigational structure to model the schema. Second, providing the user with a set of APIs to discover and enrich data. To demonstrate this approach, this work will present a proof of concept (POC) system that captures the schema of tabular-like data and represent it as a KG (GLac), with the means of LAC, an ontology for data lakes. Then it will equip the practitioners with user-friendly interface services to interact with GLac and compile a dataset for a given task. With these main contributions, the system offers promising results in terms of the quality of the generated schema.

The main findings of this thesis have been published in two venues: as an extended abstract named 'Data Lakes Empowered by Knowledge Graphs' and 'A Demonstration of KGLac: A data Discovery an Enrichment Platform for Data Science'. The former, accepted to the poster session of SIGMOD/PODS'21, presents an approach describing how to utilize KGs to facilitate leveraging the content of data lakes. The latter, accepted to the demo session of VLDB'21, provides an overview of KGLac and illustrates the various functionalities the platform supports on top of data lakes after processing their content.

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:Thesis (Masters)
Authors:Helal, Ahmed
Institution:Concordia University
Degree Name:M. Comp. Sc.
Program:Computer Science
Date:10 August 2021
Thesis Supervisor(s):Mansour, Essam
ID Code:988756
Deposited By: Ahmed Helal
Deposited On:29 Nov 2021 16:49
Last Modified:29 Nov 2021 16:49
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Research related to the current document (at the CORE website)
- Research related to the current document (at the CORE website)
Back to top Back to top