Login | Register

Advancements in Topic Modeling: Integrating Bi-Directional Recurrent Attentional Models, Neural Embeddings, and Flexible Distributions

Title:

Advancements in Topic Modeling: Integrating Bi-Directional Recurrent Attentional Models, Neural Embeddings, and Flexible Distributions

Koochemeshkian, Pantea (2024) Advancements in Topic Modeling: Integrating Bi-Directional Recurrent Attentional Models, Neural Embeddings, and Flexible Distributions. PhD thesis, Concordia University.

[thumbnail of Koochemeshkian_PhD_F2024.pdf]
Preview
Text (application/pdf)
Koochemeshkian_PhD_F2024.pdf - Accepted Version
Available under License Spectrum Terms of Access.
2MB

Abstract

A primary objective in natural language processing is the classification of texts into discrete categories. Topic models and mixture models are indispensable tools for this task, as they both acquire patterns from data in an unsupervised manner. Several extensions to established topic modeling frameworks are introduced by incorporating more flexible priors and advanced inference methods to enhance performance in text document analysis. The Multinomial Principal Component Analysis (MPCA) framework, a Dirichlet-based model, is extended by integrating generalized Dirichlet (GD) and Beta-Liouville (BL) distributions, resulting in GDMPCA and BLMPCA models. These priors address the limitations of the Dirichlet prior, such as its independent assumption within components and restricted covariance structure. Efficiency is further improved by implementing variational Bayesian inference and collapsed Gibbs sampling for fast and accurate parameter estimation.

Enhancements to the Bi-Directional Recurrent Attentional Topic Model (bi-RATM) are made by incorporating GD and BL distributions, leading to GD-bi-RATM and BL-bi-RATM models. These models leverage attention mechanisms to model relationships between sentences, offering higher flexibility and improved performance in document embedding tasks.

Extensions to the Dirichlet Multinomial Regression (DMR) and deep Dirichlet Multinomial Regression (dDMR) approaches are achieved by incorporating GD and BL distributions. This integration addresses limitations related to handling complex data structures and overfitting, with collapsed Gibbs sampling providing an efficient method for parameter inference. Experimental results on benchmark datasets demonstrate enhanced topic modeling performance, particularly in handling complex data structures and reducing overfitting.

Novel approaches are developed by integrating embeddings derived from Bert-Topic with the multi-grain clustering topic model (MGCTM). Recognizing the hierarchical and multi-scale nature of topics, these methods utilize MGCTM to capture topic structures at multiple levels of granularity. By incorporating GD and BL distributions, the expressiveness and flexibility of MGCTM are enhanced. Experiments on various datasets show superior topic coherence and granularity compared to state-of-the-art methods.

Overall, the proposed models exhibit improved interpretability and effectiveness in various natural language processing and machine learning applications, showcasing the potential of combining neural embeddings with advanced probabilistic modeling techniques.

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science > Concordia Institute for Information Systems Engineering
Item Type:Thesis (PhD)
Authors:Koochemeshkian, Pantea
Institution:Concordia University
Degree Name:Ph. D.
Program:Information and Systems Engineering
Date:20 June 2024
Thesis Supervisor(s):Bouguila, Nizar
ID Code:994607
Deposited By: pantea koochemeshkian
Deposited On:24 Oct 2024 17:59
Last Modified:24 Oct 2024 17:59
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Research related to the current document (at the CORE website)
- Research related to the current document (at the CORE website)
Back to top Back to top