Login | Register

Towards High-Quality Clustering and Analysis Algorithms for Categor- ical Data

Title:

Towards High-Quality Clustering and Analysis Algorithms for Categor- ical Data

BRNAWY, RAHMAH (2025) Towards High-Quality Clustering and Analysis Algorithms for Categor- ical Data. PhD thesis, Concordia University.

[thumbnail of Brnawy_PhD_F2025.pdf]
Preview
Text (application/pdf)
Brnawy_PhD_F2025.pdf - Accepted Version
Available under License Spectrum Terms of Access.
5MB

Abstract

Categorical data are prevalent across various disciplines, making clustering a valuable tool for analysis. However, clustering categorical data is particularly challenging due to its non-numerical nature and multi-modal distributions. While numerous techniques have been developed to address these challenges, several issues persist. Since most of the proposed techniques adapt algorithms like k-modes, which were originally designed for categorical data, they often fail to fully exploit and capture unique characteristics of categorical datasets. A main problem with those techniques require setting the number of clusters, which restricts their application and may result in bias when no expert knowledge is available. Additionally, since they tend to choose cluster seeds randomly, they perform well on binary-class data but struggle when applied to multi-class or imbalanced binary datasets. In this thesis, we propose three techniques specifically designed to address these challenges.
First, we introduce a cohesion-based clustering process that determines the potential number of clusters dynamically and that also allows detection of small clusters without relying on k-means or k-modes-like methods. Unlike conventional clustering algorithms that assign weights to all the attributes, we adopt a mechanism that assigns weights to clusters at the attribute-value level, improving cluster cohesion and interpretability. Second, we develop multi-criteria subspace-based clustering techniques that unlike the traditional subspace solution, search the entire space. Our techniques leverage two existing clustering strategies, namely density-based and theoretical clustering, to identify small, non-redundant clusters. To ensure effective merging, we extend the hierarchical clustering technique that allows discovered clusters to be combined while preventing small clusters from being absorbed into larger ones. Third, we study measuring quality and diversity methods in ensemble clustering selection. Existing methods often rely on external validation metrics, which are not well-suited to the characteristics of categorical data. Those metrics tend to introduce redun- dancy and favour large clusters, potentially overlooking smaller yet meaningful ones. Furthermore, most current approaches assess quality and diversity only at the cluster level, neglecting more gran- ular and informative perspectives. To overcome these limitations, we adopt and extend a measure based on granular computing, which aligns more naturally with the discrete and multi-level struc- ture of categorical data. The result helps evaluate quality and diversity at the class level as well as the object level, enabling a more nuanced and effective ensemble selection process. To evaluate its performance, we conduct extensive experiments using real-world and synthetic benchmark datasets. The experiment results and their analyses demonstrate overall enhanced clustering performance and helped identify small clusters in certain datasets.

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:Thesis (PhD)
Authors:BRNAWY, RAHMAH
Institution:Concordia University
Degree Name:Ph. D.
Program:Computer Science
Date:21 May 2025
Thesis Supervisor(s):Shiri, Nematollaah and Ahmad, Amir
ID Code:995858
Deposited By: RAHMAH BRNAWY
Deposited On:04 Nov 2025 15:42
Last Modified:04 Nov 2025 15:42
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Research related to the current document (at the CORE website)
- Research related to the current document (at the CORE website)
Back to top Back to top