Chen, Yewang, Tang, Shenyu, Bouguila, Nizar, Wang, Cheng, Du, Jixiang and Li, HaiLin
(2018)
*A Fast Clustering Algorithm based on pruning unnecessary distance computations in DBSCAN for High-Dimensional Data.*
Pattern Recognition
.
ISSN 00313203
(In Press)

Text (In press, accepted manuscript) (application/pdf)
1MBbouglia nizar 2018.pdf - Accepted Version Restricted to Registered users only until 5 June 2020. Available under License Spectrum Terms of Access. |

Official URL: http://dx.doi.org/10.1016/j.patcog.2018.05.030

## Abstract

Clustering is an important technique to deal with large scale data which are explosively created in internet. Most data are high-dimensional with a lot of noise, which brings great challenges to retrieval, classification and understanding. No current existing approach is “optimal” for large scale data. For example, DBSCAN requires O(n2) time, Fast-DBSCAN only works well in 2 dimensions, and ρ-Approximate DBSCAN runs in O(n) expected time which needs dimension D to be a relative small constant for the linear running time to hold. However, we prove theoretically and experimentally that ρ-Approximate DBSCAN degenerates to an O(n2) algorithm in very high dimension such that 2D > > n. In this paper, we propose a novel local neighborhood searching technique, and apply it to improve DBSCAN, named as NQ-DBSCAN, such that a large number of unnecessary distance computations can be effectively reduced. Theoretical analysis and experimental results show that NQ-DBSCAN averagely runs in O(n*log(n)) with the help of indexing technique, and the best case is O(n) if proper parameters are used, which makes it suitable for many realtime data.

Divisions: | Concordia University > Gina Cody School of Engineering and Computer Science > Concordia Institute for Information Systems Engineering |
---|---|

Item Type: | Article |

Refereed: | Yes |

Authors: | Chen, Yewang and Tang, Shenyu and Bouguila, Nizar and Wang, Cheng and Du, Jixiang and Li, HaiLin |

Journal or Publication: | Pattern Recognition |

Date: | 5 June 2018 |

Digital Object Identifier (DOI): | 10.1016/j.patcog.2018.05.030 |

Keywords: | DBSCAN ρ-Approximate DBSCANNQ-DBSCAN |

ID Code: | 983946 |

Deposited By: | MONIQUE LANE |

Deposited On: | 14 Jun 2018 19:19 |

Last Modified: | 14 Jun 2018 19:27 |

## References:

[1] J. Song, L. Gao, F. Nie, H.T. Shen, Y. Yan, N. Sebe Optimized graph learning using partial tags and multiple features for image and video annotation IEEE Transactions on Image Processing, 25 (11) (2016), pp. 4999-5011[2] J. Song, L. Gao, F. Zou, Y. Yan, N. Sebe Deep and fast: Deep learning hashing with semi-supervised graph construction Image and Vision Computing (2016)

[3] J. Song, H.T. Shen, J. Wang, Z. Huang, N. Sebe, J. Wang A distance-computation-free search scheme for binary code databases IEEE Transactions on Multimedia, 18 (3) (2016), pp. 484-495

[4] W. Zhou, M. Yang, X. Wang, H. Li, Y. Lin, Q. Tian Scalable feature matching by dual cascaded scalar quantization for image retrieval IEEE transactions on pattern analysis and machine intelligence, 38 (1)

[5] S. Zhang, Q. Huang, S. Jiang, W. Gao, Q. Tian Affective visualization and retrieval for music video IEEE Transactions on Multimedia, 12 (6) (2010), pp. 510-522

[6] A.K. Rajagopal, R. Subramanian, E. Ricci, R.L. Vieriu, O. Lanz, N. Sebe, et al. Exploring transfer learning approaches for head pose classification from multi-view surveillance images International Journal of Computer Vision, 109 (1-2) (2014), pp. 146-167

[7] Y. Yan, E. Ricci, R. Subramanian, G. Liu, O. Lanz, N. Sebe A multi-task learning framework for head pose estimation under target motion IEEE transactions on pattern analysis and machine intelligence, 38 (6) (2016), pp. 1070-1083

[8] B.F. Qaqish, J.J. OBrien, J.C. Hibbard, K.J. Clowers Accelerating high dimensional clustering with lossless data reduction Bioinformatics (2017), p. btx328

[9] Z. Deng, K.-S. Choi, Y. Jiang, J. Wang, S. Wang A survey on soft subspace clustering Information Sciences, 348 (2016), pp. 84-106

[10] O. Limwattanapibool, S. Arch-int Determination of the appropriate parameters for k-means clustering using selection of region clusters based on density dbscan (srcd-dbscan) Expert Systems (2017)

[11] L. Bai, X. Cheng, J. Liang, H. Shen, Y. Guo Fast density clustering strategies based on the k-means algorithm Pattern Recognition, 71 (2017), pp. 375-386

[12] N.A. Yousri, M.S. Kamel, M.A. Ismail A distance-relatedness dynamic model for clustering high dimensional data of arbitrary shapes and densities Pattern Recognition, 42 (7) (2009), pp. 1193-1209

[13] C. Zhong, D. Miao, R. Wang A graph-theoretical clustering method based on two rounds of minimum spanning trees Pattern Recognition, 43 (3) (2010), pp. 752-766

[14] M. Ester, H.-P. Kriegel, J. Sander Algorithms and applications for spatial data mining Geographic Data Mining and Knowledge Discovery, 5 (6) (2001)

[15] W.A. Barbakh, Y. Wu, C. Fyfe Non-standard parameter adaptation for exploratory data analysis 249, Springer (2009)

[16] J. Han, J. Pei, M. Kamber Data mining: concepts and techniques Elsevier (2011)

[17] J. Hou, W. Liu, E. Xu, H. Cui Towards parameter-independent data clustering and image segmentation Pattern Recognition, 60 (2016), pp. 25-36

[18] S. Mitra, P.P. Kundu Satellite image segmentation with shadowed c -means Information Sciences, 181 (17) (2011), pp. 3601-3613

[19] S. Das, S. Sil Kernel-induced fuzzy clustering of image pixels with an improved differential evolution algorithm Information Sciences An International Journal, 180 (8) (2010), pp. 1237-1256

[20] Y.C. Song, H.D. Meng, M.J. OGrady, G.M.P. OHare The application of cluster analysis in geophysical data interpretation Computational Geosciences, 14 (2) (2010), pp. 263-271

[21] A. Ghosh, N.S. Mishra, S. Ghosh Fuzzy clustering algorithms for unsupervised change detection in remote sensing images Information Sciences, 181 (4) (2011), pp. 699-715

[22] Y.J. Wang, H.S. Lee A clustering method to identify representative financial ratios Information Sciences, 178 (4) (2008), pp. 1087-1097

[23] J. Li, K. Wang, L. Xu Chameleon based on clustering feature tree and its application in customer segmentation Annals of Operations Research, 168 (1) (2009), pp. 225-245

[24] Q. Bsoul, J. Salim, L.Q. Zakaria An intelligent document clustering approach to detect crime patterns Procedia Technology, 11 (1) (2013), pp. 1181-1187

[25] C.-W. Huang, K.-P. Lin, M.-C. Wu, K.-C. Hung, G.-S. Liu, C.-H. Jen Intuitionistic fuzzy c-means clustering algorithm with neighborhood attraction in segmenting medical image Soft Computing, 19 (2) (2015), pp. 459-470

[26] V.P. Ananthi, P. Balasubramaniam, T. Kalaiselvi A new fuzzy clustering algorithm for the segmentation of brain tumor Soft Computing (2015), pp. 1-21

[27] R. Chinchuluun, W.S. Lee, J. Bhorania, P.M. Pardalos Clustering and Classification Algorithms in Food and Agricultural Applications: A Survey Springer US (2009)

[28] A. Hatamlou Black hole: A new heuristic optimization approach for data clustering Information Sciences, 222 (3) (2013), pp. 175-184

[29] J.G. Lee, J. Han, K.Y. Whang Trajectory clustering: a partition-and-group framework ACM SIGMOD International Conference on Management of Data (2007), pp. 593-604

[30] X.T. Yuan, B.G. Hu, R. He Agglomerative mean-shift clustering IEEE Transactions on Knowledge and Data Engineering, 24 (2) (2012), pp. 209-219

[31] W.-Y. Chen, Y. Song, H. Bai, C.-J. Lin, E.Y. Chang Parallel spectral clustering in distributed systems IEEE transactions on pattern analysis and machine intelligence, 33 (3) (2011), pp. 568-586

[32] S. Mitra A parallel clustering technique for the vehicle routing problem with split deliveries and pickups Journal of the Operational Research Society, 59 (11) (2008), pp. 1532-1546

[33] M. Ester, H.-P. Kriegel, J. Sander, X. Xu A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd, 96 (1996), pp. 226-231

[34] B. Borah, D. Bhattacharyya An improved sampling-based dbscan for large spatial databases Intelligent Sensing and Information Processing, 2004. Proceedings of International Conference on, IEEE (2004), pp. 92-96

[35] C. Ruiz, M. Spiliopoulou, E. Menasalvas C-dbscan: Density-based clustering with constraints International Workshop on Rough Sets, Fuzzy Sets, Data Mining, and Granular-Soft Computing, Springer (2007), pp. 216-223

[36] Y. He, H. Tan, W. Luo, H. Mao, D. Ma, S. Feng, J. Fan Mr-dbscan: an efficient parallel density-based clustering algorithm using mapreduce Parallel and Distributed Systems (ICPADS), 2011 IEEE 17th International Conference on, IEEE (2011), pp. 473-480

[37] K.M. Kumar, A.R.M. Reddy A fast dbscan clustering algorithm by accelerating neighbor searching using groups method Pattern Recognition, 58 (2016), pp. 39-48

[38] A. Tramacere, C. Vecchio γ-ray dbscan: a clustering algorithm applied to fermi-lat γ-ray data-i. detection performances with real and simulated data Astronomy & Astrophysics, 549 (2013), p. A138

[39] S.T. Mai, S. Goebl, C. Plant A similarity model and segmentation algorithm for white matter fiber tracts 2012 IEEE 12th International Conference on Data Mining, IEEE (2012), pp. 1014-1019

[40] J. Gan, Y. Tao Dbscan revisited: Mis-claim, un-fixability, and approximation Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, ACM (2015), pp. 519-530

[41] J. Wang, Z. Deng, K.-S. Choi, Y. Jiang, X. Luo, F.-L. Chung, S. Wang Distance metric learning for soft subspace clustering in composite kernel space Pattern Recognition, 52 (2016), pp. 113-134

[42] P. Qian, Y. Jiang, Z. Deng, L. Hu, S. Sun, S. Wang, R.F. Muzic Cluster prototypes and fuzzy memberships jointly leveraged cross-domain maximum entropy clustering IEEE transactions on cybernetics, 46 (1) (2016), pp. 181-193

[43] Z. Deng, Y. Jiang, F.-L. Chung, H. Ishibuchi, K.-S. Choi, S. Wang Transfer prototype-based fuzzy clustering IEEE Transactions on Fuzzy Systems, 24 (5) (2016), pp. 1210-1232

[44] Z. Deng, K.-S. Choi, F.-L. Chung, S. Wang Enhanced soft subspace clustering integrating within-cluster and between-cluster information Pattern Recognition, 43 (3) (2010), pp. 767-781

[45] A. Gunawan A faster algorithm for DBSCAN, Masters thesis, Technische University Eindhoven (2013) Ph.D. thesis

[46] P. Viswanath, V.S. Babu Rough-dbscan: A fast hybrid density based clustering method for large data sets Pattern Recognition Letters, 30 (16) (2009), pp. 1477-1488

[47] D. Birant, A. Kut St-dbscan: An algorithm for clustering spatial–temporal data Data & Knowledge Engineering, 60 (1) (2007), pp. 208-221

[48] S. Mahran, K. Mahar Using grid for accelerating density-based clustering IEEE International Conference on Computer and Information Technology (2008), pp. 35-40

[49] C. Xiaoyun, M. Yufang, Z. Yan, W. Ping Gmdbscan: Multi-density dbscan cluster based on grid IEEE International Conference on E-Business Engineering (2008), pp. 780-783

[50] O. Uncu, W.A. Gruver, D.B. Kotak, D. Sabaz Gridbscan: Grid density-based spatial clustering of applications with noise IEEE International Conference on Systems, Man and Cybernetics (2006), pp. 2976-2981

[51] L. Zhang, Z. Xu, F. Si Gcmddbscan: Multi-density dbscan based on grid and contribution IEEE International Conference on Dependable, Autonomic and Secure Computing (2013), pp. 502-507

[52] H. Jegou, M. Douze, C. Schmid Product quantization for nearest neighbor search IEEE transactions on pattern analysis and machine intelligence, 33 (1) (2011), pp. 117-128

[53] A. Andoni, P. Indyk Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions Commun. ACM, 51 (1) (2008), p. 117C122

[54] D.G.L. Marius Muja Scalable nearest neighbor algorithms for high dimensional data IEEE transactions on pattern analysis and machine intelligence, 36 (11) (2014), pp. 2227-2240

[55] K. Buza Feedback prediction for blogs Data analysis, machine learning and knowledge discovery, Springer (2014), pp. 145-152

[56] K. Ricanek, T. Tesafaye Morph: a longitudinal image database of normal adult age-progression International Conference on Automatic Face and Gesture Recognition (2006), pp. 341-345

[57] G. Karypis, E.-H. Han, V. Kumar Chameleon: Hierarchical clustering using dynamic modeling Computer, 32 (8) (1999), pp. 68-75

[58] A. Gionis, H. Mannila, P. Tsaparas Clustering aggregation ACM Transactions on Knowledge Discovery from Data (TKDD), 1 (1) (2007), p. 4

[59] S. Maurus, C. Plant Skinny-dip: Clustering in a sea of noise SIGKDD, ACM (2016), pp. 1055-1064

[60] P.J. Huber Robust statistics International Encyclopedia of Statistical Science, Springer (2011), pp. 1248-1251

[61] Y. Chen, S. Tang, L. Zhou, C. Wang, J. Du, T. Wang, S. Pei Decentralized clustering by finding loose and distributed density cores Information Sciences, 433-434 (2018), pp. 510-526 ArticleDownload PDF

[62] Y. Chen, S. Tang, S. Pei, C. Wang, J. Du, N. Xiong Dheat: A density heat-based algorithm for clustering with effective radius IEEE Transactions on Systems, Man, and Cybernetics: Systems, 48 (2018), pp. 649-660

[63] H.W. Kuhn The hungarian method for the assignment problem Naval research logistics quarterly, 2 (1-2) (1955), pp. 83-97

Repository Staff Only: item control page