ABSTRACT
Measurement Framework for Assessing Quality of Big Data (MEGA) in Big Data Pipelines

Dave Bhardwaj
Concordia University, 2021
Big Data is used widely in the decision-making process and businesses have seen just how powerful data can be, especially for areas such as advertising and marketing. As institutions begin relying on their Big Data systems to make more informed and strategic business decisions, the importance of the underlying data quality becomes extremely significant. In our research this is accomplished through studying and automating the quality characteristics of Big Data, more specifically, through the V’s of Big Data.

In this thesis, our aim is to not only present researchers with useful Big Data quality measurements, but to bridge the gap between theoretical measurement models of Big Data quality characteristics and the application of these metrics to real world Big Data Systems. Therefore, our thesis proposes a framework (The MEGA Framework) that can be applied to Big Data Pipelines in order to facilitate the extraction and interpretation of Big Data V’s measurement indicators.  The proposed framework allows the application of Big Data V’s measurements at any phase of the architecture process in order to flag quality anomalies of the underlying data, before they can negatively impact the decision-making process. The theoretical quality measurement models for six of the Big Data V’s, namely Volume, Variety, Velocity, Veracity, Validity, and Vincularity, are currently automated. 

The novelty of the MEGA approach includes the ability to: i) process both structured and unstructured data, ii) track a variety of quality indicators defined for the V’s, iii) flag datasets that pass a certain quality threshold, and iv) define a general infrastructure for collecting, analyzing, and reporting the V's measurement indicators for trustworthy and meaningful decision-making.