Yao, Jinli (2025) Towards Better Clustering: From Quality Criteria to Advanced Hierarchical Algorithms. PhD thesis, Concordia University.
Preview |
Text (application/pdf)
23MBYao_PhD_F2025.pdf - Accepted Version Available under License Spectrum Terms of Access. |
Abstract
Clustering is a cornerstone of unsupervised learning, offering powerful tools for uncovering patterns and natural groupings in unlabeled data. Despite its extensive applications across diverse
fields, clustering research faces persistent challenges, including inconsistent definitions, varied evaluation criteria, and difficulties in handling complex data characteristics. This thesis addresses these
challenges by integrating theoretical insights with algorithmic innovations to enhance clustering methodologies and their applicability.
The first part of this work explores the fundamental question, ”What defines a good cluster?” Through a systematic review of clustering criteria, principles, and evaluation metrics, it highlights the diversity of clustering algorithms and the challenges posed by high-dimensional, overlapping, and varied-density data. This foundational analysis establishes a structured understanding of clustering quality and its implications for algorithm design.
Building on these principles, the thesis introduces Gauging-δ, a nonparametric hierarchical clustering algorithm capable of handling diverse cluster shapes. Employing an adaptive mergeability function, the algorithm iteratively merges clusters based on local data statistics and environmental factors. Rigorous experiments on synthetic and real-world datasets demonstrate its robustness in
identifying well-separated clusters and its sensitivity to feature and distance metric selection.
The thesis further presents Gauging-β, a density-aware hierarchical clustering algorithm addressing challenges in data separation. The proposed algorithm leverages density-based methods to identify and remove border points, effectively separating data sets. Gauging-δ is then applied to the remaining points to generate the main clusters. Finally, the border points are reintegrated into the formed clusters. Experimental results demonstrate that the algorithm is capable of handling both convex and non-convex, as well as well-separated and poorly-separated data sets. The impact of parameter settings on clustering outcomes is thoroughly investigated. Further experiments on real-world data sets reveal that the consistency of clustering results with classification labels strongly depends on an appropriate measure of sample similarity.
Together, these three components offer a coherent approach to clustering, from clarifying theoretical concepts of cluster quality to developing algorithms capable of identifying meaningful clusters in various synthetic and complex real-world datasets.
| Divisions: | Concordia University > Gina Cody School of Engineering and Computer Science > Concordia Institute for Information Systems Engineering |
|---|---|
| Item Type: | Thesis (PhD) |
| Authors: | Yao, Jinli |
| Institution: | Concordia University |
| Degree Name: | Ph. D. |
| Program: | Information and Systems Engineering |
| Date: | 7 July 2025 |
| Thesis Supervisor(s): | Zeng, Yong |
| ID Code: | 996054 |
| Deposited By: | Jinli Yao |
| Deposited On: | 04 Nov 2025 16:47 |
| Last Modified: | 04 Nov 2025 16:47 |
Repository Staff Only: item control page


Download Statistics
Download Statistics