Login | Register

Bayesian Parameter Estimation of Probabilistic Models for Information Retrieval and Clustering in Discrete Data Spaces

Title:

Bayesian Parameter Estimation of Probabilistic Models for Information Retrieval and Clustering in Discrete Data Spaces

Salmanzade Yazdi, Sahar (2023) Bayesian Parameter Estimation of Probabilistic Models for Information Retrieval and Clustering in Discrete Data Spaces. Masters thesis, Concordia University.

[thumbnail of Salmanzade Yazdi_MASc_S2024.pdf]
Preview
Text (application/pdf)
Salmanzade Yazdi_MASc_S2024.pdf - Accepted Version
Available under License Spectrum Terms of Access.
1MB

Abstract

In the contemporary era, a substantial amount of data is generated, prompting a critical need to effectively model data for thorough analysis and extraction of meaningful patterns. This is particularly crucial in various real-world applications, with natural language processing standing out as an area urgently requiring data analysis. Tasks such as document retrieval, spam email filtering, smart assistant applications, and sentiment analysis exemplify the extensive scope of natural language processing (NLP) and text mining. Addressing this context, various Bayesian models have been developed to aptly model data and extract essential information by considering latent topics. These models, grounded in probabilistic graphical models like Bayesian networks, capture the probabilistic dependencies between variables. Their ability to incorporate evidence from previous user knowledge enhances retrieval performance significantly. Furthermore, Bayesian network models exhibit effectiveness and generality surpassing classical information retrieval models like boolean, vector, and probabilistic models. This versatility positions Bayesian models as valuable approaches in information retrieval. Topic modeling, a valuable technique in text mining, plays a key role in uncovering concealed thematic structures within document collections, facilitates the identification of clusters of "topics" or co-occurring words, and aids in understanding underlying themes and patterns from data. The unsupervised classification of documents, akin to clustering in numeric data, allows for the discovery of natural document groups, even in the absence of predefined topics.
However, significant challenges persist, including the management of queries not present in data collection, sparsity within datasets, especially in the age of big data, and addressing correlations between observations. This thesis suggests innovative Bayesian extensions for data modeling, utilizing the Generalized Dirichlet distribution and the Beta-Liouville distribution as prior probability distributions to incorporate new queries into the topic space. Furthermore, these priors are integrated into a probabilistic clustering-projection model to evaluate their impact on both clustering and projection jointly. Lastly, in addressing the issues and hurdles associated with data sparsity, the Generalized Dirichlet distribution and the Beta-Liouville distribution are advocated as prior probability distributions to confront these challenges.
The selection of a suitable prior is crucial in Bayesian data modeling, and these distributions are explored for their ability to model various non-Gaussian data and overcome the limited covariance structures of other distributions like the Dirichlet distribution. Following the determination of prior probabilities, the next step involves estimating optimized parameters for the distribution and model. An iterative parameter estimation model, utilizing the Expectation Maximization algorithm, is developed to maximize data likelihood. The simplicity of the proposed iterative algorithms allows these models to successfully handle real-time data, making them applicable across a broad range of practical scenarios.

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science > Concordia Institute for Information Systems Engineering
Item Type:Thesis (Masters)
Authors:Salmanzade Yazdi, Sahar
Institution:Concordia University
Degree Name:M.A. Sc.
Program:Quality Systems Engineering
Date:20 December 2023
Thesis Supervisor(s):Bouguila, Nizar
ID Code:993364
Deposited By: Sahar Salmanzade yazdi
Deposited On:05 Jun 2024 16:53
Last Modified:05 Jun 2024 16:53
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Research related to the current document (at the CORE website)
- Research related to the current document (at the CORE website)
Back to top Back to top