Badiei, Bahareh (2023) On Parallelization of Categorical Data Clustering. Masters thesis, Concordia University.
Preview |
Text (application/pdf)
2MBBadiei_ MCompSc_S2023.pdf - Accepted Version Available under License Spectrum Terms of Access. |
Abstract
We study parallelization of categorical data clustering algorithms in an MPI platform. Clustering such data has been a daunting task even for sequential algorithms, mainly due to the challenges in finding suitable similarity/distance measures. We propose a parallel version of the k-modes algorithm, called PV3, which maintains the same clustering quality as produced by the sequential approach while achieving reasonable speed-ups. PV3 is programmed to ensure deterministic processing in a parallel environment. To produce better clustering results, we then develop an initialization method called Revised Density Method (RDM) based on the notion of density. Additionally, we develop variants of the RDM method to further enhance its performance. we then study effective ways to parallelize RDM and its variants. To further exploit parallelism opportunities, we develop an Ensemble Parallelizing Process (EPP) framework. This framework can be used with any desired initialization/clustering algorithms with different levels of parallelism. Using our different RDM initialization techniques along with the PV3 algorithm in the EPP framework, we then build an RDM realization of EPP, called RDM EPP. The result of our numerous experiments using benchmark categorical datasets indicate the quality metric of RDM EPP to be among the top three sequential k-modes based clustering algorithms. In terms of speed up, the results indicate to be 7 times faster for some datasets, though much larger datasets are required for a more comprehensive scalability study of RDM EPP.
Divisions: | Concordia University > Gina Cody School of Engineering and Computer Science > General Studies |
---|---|
Item Type: | Thesis (Masters) |
Authors: | Badiei, Bahareh |
Institution: | Concordia University |
Degree Name: | M. Comp. Sc. |
Program: | Computer Science |
Date: | 8 May 2023 |
Thesis Supervisor(s): | Goswami, Dhrubajyoti and Shiri, Nematollaah |
ID Code: | 992340 |
Deposited By: | Bahareh Badiei |
Deposited On: | 14 Nov 2023 19:52 |
Last Modified: | 14 Nov 2023 19:52 |
Repository Staff Only: item control page