BRNAWY, RAHMAH (2025) Towards High-Quality Clustering and Analysis Algorithms for Categor- ical Data. PhD thesis, Concordia University.
Preview |
Text (application/pdf)
5MBBrnawy_PhD_F2025.pdf - Accepted Version Available under License Spectrum Terms of Access. |
Abstract
Categorical data are prevalent across various disciplines, making clustering a valuable tool for analysis. However, clustering categorical data is particularly challenging due to its non-numerical nature and multi-modal distributions. While numerous techniques have been developed to address these challenges, several issues persist. Since most of the proposed techniques adapt algorithms like k-modes, which were originally designed for categorical data, they often fail to fully exploit and capture unique characteristics of categorical datasets. A main problem with those techniques require setting the number of clusters, which restricts their application and may result in bias when no expert knowledge is available. Additionally, since they tend to choose cluster seeds randomly, they perform well on binary-class data but struggle when applied to multi-class or imbalanced binary datasets. In this thesis, we propose three techniques specifically designed to address these challenges.
First, we introduce a cohesion-based clustering process that determines the potential number of clusters dynamically and that also allows detection of small clusters without relying on k-means or k-modes-like methods. Unlike conventional clustering algorithms that assign weights to all the attributes, we adopt a mechanism that assigns weights to clusters at the attribute-value level, improving cluster cohesion and interpretability. Second, we develop multi-criteria subspace-based clustering techniques that unlike the traditional subspace solution, search the entire space. Our techniques leverage two existing clustering strategies, namely density-based and theoretical clustering, to identify small, non-redundant clusters. To ensure effective merging, we extend the hierarchical clustering technique that allows discovered clusters to be combined while preventing small clusters from being absorbed into larger ones. Third, we study measuring quality and diversity methods in ensemble clustering selection. Existing methods often rely on external validation metrics, which are not well-suited to the characteristics of categorical data. Those metrics tend to introduce redun- dancy and favour large clusters, potentially overlooking smaller yet meaningful ones. Furthermore, most current approaches assess quality and diversity only at the cluster level, neglecting more gran- ular and informative perspectives. To overcome these limitations, we adopt and extend a measure based on granular computing, which aligns more naturally with the discrete and multi-level struc- ture of categorical data. The result helps evaluate quality and diversity at the class level as well as the object level, enabling a more nuanced and effective ensemble selection process. To evaluate its performance, we conduct extensive experiments using real-world and synthetic benchmark datasets. The experiment results and their analyses demonstrate overall enhanced clustering performance and helped identify small clusters in certain datasets.
| Divisions: | Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering |
|---|---|
| Item Type: | Thesis (PhD) |
| Authors: | BRNAWY, RAHMAH |
| Institution: | Concordia University |
| Degree Name: | Ph. D. |
| Program: | Computer Science |
| Date: | 21 May 2025 |
| Thesis Supervisor(s): | Shiri, Nematollaah and Ahmad, Amir |
| ID Code: | 995858 |
| Deposited By: | RAHMAH BRNAWY |
| Deposited On: | 04 Nov 2025 15:42 |
| Last Modified: | 04 Nov 2025 15:42 |
Repository Staff Only: item control page


Download Statistics
Download Statistics