Koochemeshkian, Pantea (2024) Advancements in Topic Modeling: Integrating Bi-Directional Recurrent Attentional Models, Neural Embeddings, and Flexible Distributions. PhD thesis, Concordia University.
Preview |
Text (application/pdf)
2MBKoochemeshkian_PhD_F2024.pdf - Accepted Version Available under License Spectrum Terms of Access. |
Abstract
A primary objective in natural language processing is the classification of texts into discrete categories. Topic models and mixture models are indispensable tools for this task, as they both acquire patterns from data in an unsupervised manner. Several extensions to established topic modeling frameworks are introduced by incorporating more flexible priors and advanced inference methods to enhance performance in text document analysis. The Multinomial Principal Component Analysis (MPCA) framework, a Dirichlet-based model, is extended by integrating generalized Dirichlet (GD) and Beta-Liouville (BL) distributions, resulting in GDMPCA and BLMPCA models. These priors address the limitations of the Dirichlet prior, such as its independent assumption within components and restricted covariance structure. Efficiency is further improved by implementing variational Bayesian inference and collapsed Gibbs sampling for fast and accurate parameter estimation.
Enhancements to the Bi-Directional Recurrent Attentional Topic Model (bi-RATM) are made by incorporating GD and BL distributions, leading to GD-bi-RATM and BL-bi-RATM models. These models leverage attention mechanisms to model relationships between sentences, offering higher flexibility and improved performance in document embedding tasks.
Extensions to the Dirichlet Multinomial Regression (DMR) and deep Dirichlet Multinomial Regression (dDMR) approaches are achieved by incorporating GD and BL distributions. This integration addresses limitations related to handling complex data structures and overfitting, with collapsed Gibbs sampling providing an efficient method for parameter inference. Experimental results on benchmark datasets demonstrate enhanced topic modeling performance, particularly in handling complex data structures and reducing overfitting.
Novel approaches are developed by integrating embeddings derived from Bert-Topic with the multi-grain clustering topic model (MGCTM). Recognizing the hierarchical and multi-scale nature of topics, these methods utilize MGCTM to capture topic structures at multiple levels of granularity. By incorporating GD and BL distributions, the expressiveness and flexibility of MGCTM are enhanced. Experiments on various datasets show superior topic coherence and granularity compared to state-of-the-art methods.
Overall, the proposed models exhibit improved interpretability and effectiveness in various natural language processing and machine learning applications, showcasing the potential of combining neural embeddings with advanced probabilistic modeling techniques.
Divisions: | Concordia University > Gina Cody School of Engineering and Computer Science > Concordia Institute for Information Systems Engineering |
---|---|
Item Type: | Thesis (PhD) |
Authors: | Koochemeshkian, Pantea |
Institution: | Concordia University |
Degree Name: | Ph. D. |
Program: | Information and Systems Engineering |
Date: | 20 June 2024 |
Thesis Supervisor(s): | Bouguila, Nizar |
ID Code: | 994607 |
Deposited By: | pantea koochemeshkian |
Deposited On: | 24 Oct 2024 17:59 |
Last Modified: | 24 Oct 2024 17:59 |
Repository Staff Only: item control page