Positive Data Clustering based on Generalized Inverted Dirichlet Mixture Model

Title:

Positive Data Clustering based on Generalized Inverted Dirichlet Mixture Model

Al Mashrgy, Mohamed (2015) Positive Data Clustering based on Generalized Inverted Dirichlet Mixture Model. PhD thesis, Concordia University.

Preview

Text (application/pdf)
Al_Mashrgy_PhD_F2015.pdf - Accepted Version

2MB

Abstract

Recent advances in processing and networking capabilities of computers have caused an accumulation
of immense amounts of multimodal multimedia data (image, text, video). These data
are generally presented as high-dimensional vectors of features. The availability of these highdimensional
data sets has provided the input to a large variety of statistical learning applications
including clustering, classification, feature selection, outlier detection and density estimation. In
this context, a finite mixture offers a formal approach to clustering and a powerful tool to tackle
the problem of data modeling. A mixture model assumes that the data is generated by a set of
parametric probability distributions. The main learning process of a mixture model consists of the
following two parts: parameter estimation and model selection (estimation the number of components).
In addition, other issues may be considered during the learning process of mixture models
such as the: a) feature selection and b) outlier detection. The main objective of this thesis is to
work with different kinds of estimation criteria and to incorporate those challenges into a single
framework.
The first contribution of this thesis is to propose a statistical framework which can tackle the problem
of parameter estimation, model selection, feature selection, and outlier rejection in a unified
model. We propose to use feature saliency and introduce an expectation-maximization (EM) algorithm
for the estimation of the Generalized Inverted Dirichlet (GID) mixture model. By using
the Minimum Message Length (MML), we can identify how much each feature contributes to
our model as well as determine the number of components. The presence of outliers is an added
challenge and is handled by incorporating an auxiliary outlier component, to which we associate a uniform density. Experimental results on synthetic data, as well as real world applications involving
visual scenes and object classification, indicates that the proposed approach was promising,
even though low-dimensional representation of the data was applied. In addition, it showed
the importance of embedding an outlier component to the proposed model. EM learning suffers
from significant drawbacks. In order to overcome those drawbacks, a learning approach using a
Bayesian framework is proposed as our second contribution. This learning is based on the estimation
of the parameters posteriors and by considering the prior knowledge about these parameters.
Calculation of the posterior distribution of each parameter in the model is done by using Markov
chain Monte Carlo (MCMC) simulation methods - namely, the Gibbs sampling and the Metropolis-
Hastings methods. The Bayesian Information Criterion (BIC) was used for model selection. The
proposed model was validated on object classification and forgery detection applications. For the
first two contributions, we developed a finite GID mixture. However, in the third contribution,
we propose an infinite GID mixture model. The proposed model simutaneously tackles the clustering
and feature selection problems. The proposed learning model is based on Gibbs sampling.
The effectiveness of the proposed method is shown using image categorization application. Our
last contribution in this thesis is another fully Bayesian approach for a finite GID mixture learning
model using the Reversible Jump Markov Chain Monte Carlo (RJMCMC) technique. The
proposed algorithm allows for the simultaneously handling of the model selection and parameter estimation for high dimensional data. The merits of this approach are investigated using synthetic
data, and data generated from a challenging namely object detection.

Divisions:	Concordia University > Gina Cody School of Engineering and Computer Science > Electrical and Computer Engineering
Item Type:	Thesis (PhD)
Authors:	Al Mashrgy, Mohamed
Institution:	Concordia University
Degree Name:	Ph. D.
Program:	Electrical and Computer Engineering
Date:	28 May 2015
Thesis Supervisor(s):	Nizar, Bouguila
ID Code:	980105
Deposited By:	MOHAMED ALMASHRGY
Deposited On:	27 Oct 2015 19:52
Last Modified:	18 Jan 2018 17:50

Repository Staff Only: item control page

Download Statistics

Downloads per month over past year

Research related to the current document (at the CORE website)

Spectrum Research Repository

Positive Data Clustering based on Generalized Inverted Dirichlet Mixture Model

Positive Data Clustering based on Generalized Inverted Dirichlet Mixture Model

Abstract