Login | Register

Extensions to the Latent Dirichlet Allocation Topic Model Using Flexible Priors

Title:

Extensions to the Latent Dirichlet Allocation Topic Model Using Flexible Priors

Ihou, Koffi Eddy (2020) Extensions to the Latent Dirichlet Allocation Topic Model Using Flexible Priors. PhD thesis, Concordia University.

[thumbnail of Ihou_PhD_S2021.pdf]
Preview
Text (application/pdf)
Ihou_PhD_S2021.pdf - Accepted Version
5MB

Abstract

Intrinsically, topic models have always their likelihood functions fixed to multinomial
distributions as they operate on count data instead of Gaussian data. As a result,
their performances ultimately depend on the flexibility of the chosen prior distributions
when following the Bayesian paradigm compared to classical approaches such as PLSA
(probabilistic latent semantic analysis), unigrams and mixture of unigrams that do not use
prior information. The standard LDA (latent Dirichlet allocation) topic model operates
with symmetric Dirichlet distribution (as a conjugate prior) which has been found to carry
some limitations due to its independent structure that tends to hinder performance for
instance in topic correlation including positively correlated data processing. Compared to
classical ML estimators, the use of priors ultimately presents another unique advantage of
smoothing out the multinomials while enhancing predictive topic models.
In this thesis, we propose a series of flexible priors such as generalized Dirichlet (GD)
and Beta-Liouville (BL) for our topic models within the collapsed representation, leading
to much improved CVB (collapsed variational Bayes) update equations compared to ones
from the standard LDA. This is because the flexibility of these priors improves significantly
the lower bounds in the corresponding CVB algorithms. We also show the robustness of our
proposed CVB inferences when using simultaneously the BL and GD in hybrid generative-discriminative models
where the generative stage produces good and heterogeneous topic
features that are used in the discriminative stage by powerful classifiers such as SVMs
(support vector machines) as we propose efficient probabilistic kernels to facilitate processing
(classification) of documents based on topic signatures. Doing so, we implicitly cast topic
modeling which is an unsupervised learning method into a supervised learning technique.
Furthermore, due to the complexity of the CVB algorithm (as it requires second order
Taylor expansions) in general, despite its flexibility, we propose a much simpler and tractable
update equation using a MAP (maximum a posteriori) framework with the standard EM
(expectation-maximization) algorithm. As most Bayesian posteriors are not tractable for
complex models, we ultimately propose the MAP-LBLA (latent BL allocation) where we
characterize the contributions of asymmetric BL priors over the symmetric Dirichlet (Dir).
The proposed MAP technique importantly offers a point estimate (mode) with a much
tractable solution. In the MAP, we show that point estimate could be easy to implement
than full Bayesian analysis that integrates over the entire parameter space. The MAP
implicitly exhibits some equivalent relationship with the CVB especially the zero order
approximations CVB0 and its stochastic version SCVB0. The proposed method enhances
performances in information retrieval in text document analysis.
We show that parametric topic models (as they are finite dimensional methods) have a
much smaller hypothesis space and they generally suffer from model selection. We therefore
propose a Bayesian nonparametric (BNP) technique that uses the Hierarchical Dirichlet
process (HDP) as conjugate prior to the document multinomial distributions where the
asymmetric BL serves as a diffuse (probability) base measure that provides the global
atoms (topics) that are shared among documents. The heterogeneity in the topic structure
helps in providing an alternative to model selection because the nonparametric topic model
(which is infinite dimensional with a much bigger hypothesis space) could now prune out
irrelevant topics based on the associated probability masses to only retain the most relevant
ones.
We also show that for large scale applications, stochastic optimizations using natural
gradients of the objective functions have demonstrated significant performances when we
learn rapidly both data and parameters in online fashion (streaming). We use both
predictive likelihood and perplexity as evaluation methods to assess the robustness of our
proposed topic models as we ultimately refer to probability as a way to quantify uncertainty
in our Bayesian framework. We improve object categorization in terms of inferences through
the flexibility of our prior distributions in the collapsed space. We also improve information
retrieval technique with the MAP and the HDP-LBLA topic models while extending the
standard LDA. These two applications present the ultimate capability of enhancing a search
engine based on topic models.

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science > Concordia Institute for Information Systems Engineering
Item Type:Thesis (PhD)
Authors:Ihou, Koffi Eddy
Institution:Concordia University
Degree Name:Ph. D.
Program:Information and Systems Engineering
Date:23 November 2020
Thesis Supervisor(s):Bouguila, Nizar and Bouachir, Wassim
ID Code:988097
Deposited By: Koffi Eddy Ihou
Deposited On:29 Jun 2021 20:53
Last Modified:29 Jun 2021 20:53
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Research related to the current document (at the CORE website)
- Research related to the current document (at the CORE website)
Back to top Back to top