Luo, Zhiwen (2022) Extensions to Cross-collection Topic Models with Parallel Inference and Differential Privacy using Flexible Priors. Masters thesis, Concordia University.
Preview |
Text (application/pdf)
10MBLuo_MA_F2022.pdf - Accepted Version Available under License Spectrum Terms of Access. |
Abstract
Cross-collection topic models extend previous single-collection topic models such as Latent Dirichlet Allocation (LDA) to multiple collections. The purpose of cross-collection topic modelling is to model document-topic representations and reveal similarities between topics and differences among groups. The limitation of Dirichlet prior has impeded the state-of-the-art cross-collection topic models' performance, leading to the introduction of more flexible priors.
In this thesis, we first introduce a novel topic model, GPU-based cross-collection latent generalized Dirichlet allocation (ccLGDA), exploring the similarities and differences across multiple data collections by introducing generalized Dirichlet (GD) distribution to overcome the limitations of Dirichlet prior for conventional topic models while improving computational efficiency. As a more flexible prior, the generalized Dirichlet distribution provides a more general covariance structure and valuable properties, such as capturing collection relationships between latent topics and enhancing the cross-collection topic model. Indeed, this new GD-based model utilizes the Graphics Processing Unit to perform a parallel inference on a single machine, which provides a scalable and efficient training method for massive data. Therefore, the new approach, the GPU-based ccLGDA, proposes a scheme that incorporates a thorough generative process into a robust inference process with powerful computational techniques to compare multiple data collections and find interpretable topics. Its performance in comparative text mining and document classification shows its merits.
Furthermore, the restriction of Dirichlet prior and the significant privacy risk have hampered cross-collection topic models' performance and utility. The training of those cross-collection topic models may, in particular, leak sensitive information from the training dataset. To address the two issues mentioned above, we propose another novel model, cross-collection latent Beta-Liouville allocation (ccLBLA), which operates a more powerful prior, Beta-Liouville distribution with a more general covariance structure that brings a better capability in topic correlation analysis with fewer parameters than GD distribution. To provide privacy protection for the ccLBLA model, we leverage the inherent differential privacy guarantee of the Collapsed Gibbs Sampling (CGS) inference scheme and then propose a centralized privacy-preserving algorithm for the ccLBLA model (HDP-ccLBLA) that prevents inferring data from intermediate statistics during the CGS training process without sacrificing its utility. More crucially, our technique is the first to use the cross-collection topic model in image classification applications and investigate the cross-collection topic model's capabilities. The experimental results for comparative text mining and image classification will show the merits of our proposed approach.
Divisions: | Concordia University > Gina Cody School of Engineering and Computer Science > Concordia Institute for Information Systems Engineering |
---|---|
Item Type: | Thesis (Masters) |
Authors: | Luo, Zhiwen |
Institution: | Concordia University |
Degree Name: | M.A. Sc. |
Program: | Information Systems Security |
Date: | 7 July 2022 |
Thesis Supervisor(s): | Bouguila, Nizar |
ID Code: | 990678 |
Deposited By: | Zhiwen Luo |
Deposited On: | 27 Oct 2022 14:20 |
Last Modified: | 27 Oct 2022 14:20 |
Repository Staff Only: item control page