Login | Register

Extensions to Cross-collection Topic Models with Parallel Inference and Differential Privacy using Flexible Priors

Title:

Extensions to Cross-collection Topic Models with Parallel Inference and Differential Privacy using Flexible Priors

Luo, Zhiwen (2022) Extensions to Cross-collection Topic Models with Parallel Inference and Differential Privacy using Flexible Priors. Masters thesis, Concordia University.

[thumbnail of Luo_MA_F2022.pdf]
Preview
Text (application/pdf)
Luo_MA_F2022.pdf - Accepted Version
Available under License Spectrum Terms of Access.
10MB

Abstract

Cross-collection topic models extend previous single-collection topic models such as Latent Dirichlet Allocation (LDA) to multiple collections. The purpose of cross-collection topic modelling is to model document-topic representations and reveal similarities between topics and differences among groups. The limitation of Dirichlet prior has impeded the state-of-the-art cross-collection topic models' performance, leading to the introduction of more flexible priors.

In this thesis, we first introduce a novel topic model, GPU-based cross-collection latent generalized Dirichlet allocation (ccLGDA), exploring the similarities and differences across multiple data collections by introducing generalized Dirichlet (GD) distribution to overcome the limitations of Dirichlet prior for conventional topic models while improving computational efficiency. As a more flexible prior, the generalized Dirichlet distribution provides a more general covariance structure and valuable properties, such as capturing collection relationships between latent topics and enhancing the cross-collection topic model. Indeed, this new GD-based model utilizes the Graphics Processing Unit to perform a parallel inference on a single machine, which provides a scalable and efficient training method for massive data. Therefore, the new approach, the GPU-based ccLGDA, proposes a scheme that incorporates a thorough generative process into a robust inference process with powerful computational techniques to compare multiple data collections and find interpretable topics. Its performance in comparative text mining and document classification shows its merits.

Furthermore, the restriction of Dirichlet prior and the significant privacy risk have hampered cross-collection topic models' performance and utility. The training of those cross-collection topic models may, in particular, leak sensitive information from the training dataset. To address the two issues mentioned above, we propose another novel model, cross-collection latent Beta-Liouville allocation (ccLBLA), which operates a more powerful prior, Beta-Liouville distribution with a more general covariance structure that brings a better capability in topic correlation analysis with fewer parameters than GD distribution. To provide privacy protection for the ccLBLA model, we leverage the inherent differential privacy guarantee of the Collapsed Gibbs Sampling (CGS) inference scheme and then propose a centralized privacy-preserving algorithm for the ccLBLA model (HDP-ccLBLA) that prevents inferring data from intermediate statistics during the CGS training process without sacrificing its utility. More crucially, our technique is the first to use the cross-collection topic model in image classification applications and investigate the cross-collection topic model's capabilities. The experimental results for comparative text mining and image classification will show the merits of our proposed approach.

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science > Concordia Institute for Information Systems Engineering
Item Type:Thesis (Masters)
Authors:Luo, Zhiwen
Institution:Concordia University
Degree Name:M.A. Sc.
Program:Information Systems Security
Date:7 July 2022
Thesis Supervisor(s):Bouguila, Nizar
ID Code:990678
Deposited By: Zhiwen Luo
Deposited On:27 Oct 2022 14:20
Last Modified:27 Oct 2022 14:20
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Research related to the current document (at the CORE website)
- Research related to the current document (at the CORE website)
Back to top Back to top