Login | Register

Embedded spherical probabilistic modeling for topic discovery and text representation learning

Title:

Embedded spherical probabilistic modeling for topic discovery and text representation learning

Ennajari, Hafsa (2023) Embedded spherical probabilistic modeling for topic discovery and text representation learning. PhD thesis, Concordia University.

[thumbnail of Ennajari_PhD_F2023.pdf]
Text (application/pdf)
Ennajari_PhD_F2023.pdf - Accepted Version
Restricted to Repository staff only until 14 September 2025.
Available under License Spectrum Terms of Access.
29MB

Abstract

Every day, large amounts of text data are generated on the web. Taking advantage of such data necessitates good methods of retrieval, exploration, and analysis to extract hidden knowledge from these voluminous unstructured texts. In this context, probabilistic topic modeling is regarded as an effective text mining technique that uncovers the main topics from an unlabeled set of documents. Topic models have been successfully used in various domains to exhibit hidden topics, e.g., marketing, medicine, and political sciences. However, the inferred topics by conventional topic models are often unclear and not easy to interpret, because they do not account for semantic structures in language. Recently, several topic modeling approaches have been proposed to leverage external knowledge to enhance the quality of the learned topics, but they still assume a Multinomial or Gaussian document likelihood in the Euclidean space, which often results in information loss and poor performance. In this thesis, we introduce a set of probabilistic embedded spherical topic models designed to address several challenges, including lack of topic interpretability, high-dimensionality, and sparsity. Our approaches involve integrating knowledge graphs and word embeddings within a non-Euclidean curved space, namely the hypersphere, to enhance topic interpretability and generate discriminative text representations. The proposed models effectively handle a wide range of scenarios, encompassing unsupervised and supervised learning tasks. Experimental results demonstrate the effectiveness of the proposed algorithms in discovering coherent topics and learning high-quality text representations, which prove valuable for common Natural Language Processing (NLP) tasks across diverse benchmark datasets. These findings further highlight the advantages of modeling textual data on the surface of the unit-hypersphere using directional distributions while incorporating word and knowledge graph embeddings.

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science > Concordia Institute for Information Systems Engineering
Item Type:Thesis (PhD)
Authors:Ennajari, Hafsa
Institution:Concordia University
Degree Name:Ph. D.
Program:Information and Systems Engineering
Date:8 September 2023
Thesis Supervisor(s):Bouguila, Nizar and Bentahar, Jamal
ID Code:992949
Deposited By: Hafsa Ennajari
Deposited On:16 Nov 2023 19:34
Last Modified:16 Nov 2023 19:34
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Research related to the current document (at the CORE website)
- Research related to the current document (at the CORE website)
Back to top Back to top