Login | Register

On Parallelization of Categorical Data Clustering

Title:

On Parallelization of Categorical Data Clustering

Badiei, Bahareh (2023) On Parallelization of Categorical Data Clustering. Masters thesis, Concordia University.

[thumbnail of Badiei_ MCompSc_S2023.pdf]
Preview
Text (application/pdf)
Badiei_ MCompSc_S2023.pdf - Accepted Version
Available under License Spectrum Terms of Access.
2MB

Abstract

We study parallelization of categorical data clustering algorithms in an MPI platform. Clustering such data has been a daunting task even for sequential algorithms, mainly due to the challenges in finding suitable similarity/distance measures. We propose a parallel version of the k-modes algorithm, called PV3, which maintains the same clustering quality as produced by the sequential approach while achieving reasonable speed-ups. PV3 is programmed to ensure deterministic processing in a parallel environment. To produce better clustering results, we then develop an initialization method called Revised Density Method (RDM) based on the notion of density. Additionally, we develop variants of the RDM method to further enhance its performance. we then study effective ways to parallelize RDM and its variants. To further exploit parallelism opportunities, we develop an Ensemble Parallelizing Process (EPP) framework. This framework can be used with any desired initialization/clustering algorithms with different levels of parallelism. Using our different RDM initialization techniques along with the PV3 algorithm in the EPP framework, we then build an RDM realization of EPP, called RDM EPP. The result of our numerous experiments using benchmark categorical datasets indicate the quality metric of RDM EPP to be among the top three sequential k-modes based clustering algorithms. In terms of speed up, the results indicate to be 7 times faster for some datasets, though much larger datasets are required for a more comprehensive scalability study of RDM EPP.

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science > General Studies
Item Type:Thesis (Masters)
Authors:Badiei, Bahareh
Institution:Concordia University
Degree Name:M. Comp. Sc.
Program:Computer Science
Date:8 May 2023
Thesis Supervisor(s):Goswami, Dhrubajyoti and Shiri, Nematollaah
ID Code:992340
Deposited By: Bahareh Badiei
Deposited On:14 Nov 2023 19:52
Last Modified:14 Nov 2023 19:52
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Research related to the current document (at the CORE website)
- Research related to the current document (at the CORE website)
Back to top Back to top