Su, Weijia (2013) A Hierarchical Statistical Framework for the Extraction of Semantically Related Words in Textual Documents. Masters thesis, Concordia University.
Preview |
Text (application/pdf)
668kBthesis_pdfa.pdf - Accepted Version Available under License Spectrum Terms of Access. |
Abstract
Nowadays there exist a lot of documents in electronic format on the Internet, such as daily news, blog articles, messages posted online, even books and magazines. The information that can be extracted from these documents is of particular importance to several agencies and companies (e.g. security agencies, insurance companies, advertising and marketing companies, etc.). In the case of security, for instance, recent studies have shown that cyber criminals generally exchange their experiences and knowledge via media such as forums and blogs. These exchanged data, if well extracted and modeled, can provide significant clues to agencies operating in the security field. However, managing and processing the huge quantity of multimodal (i.e. image, video, text, audio) information present on the Web is a challenging task. In this thesis, we focus on textual data for which many statistical language modeling frameworks have been developed to facilitate the management of digitized texts. Many of these approaches have achieved great performances on various applications. However, most of them have focused on modeling documents individually, while in real world most documents are related, organized and archived into categories according to their themes. The main goal of this thesis is to propose a hierarchical statistical model to analyze documents collections, characterized by a hierarchical structure, to find hidden information and detect potential threats according to them. The proposed model is part of a large cyber security forensics system that we are designing to discover and capture potential security threats by retrieving and analyzing data gathered from the Web. Our approach models each node in a given textual collection using advanced statistical techniques and allows capturing the semantic information hidden inside it. In particular, a log-bilinear model is adopted to describe words in vector space in such a way that their correlations can be discovered and derived, from their representations, at each level of the hierarchical structure. Experimental results on real world data illustrate the merits of our model and its efficiency in extracting hidden semantic information from documents collections.
Divisions: | Concordia University > Gina Cody School of Engineering and Computer Science > Concordia Institute for Information Systems Engineering |
---|---|
Item Type: | Thesis (Masters) |
Authors: | Su, Weijia |
Institution: | Concordia University |
Degree Name: | M.A. Sc. |
Program: | Quality Systems Engineering |
Date: | 25 June 2013 |
Thesis Supervisor(s): | Bouguila, Nizar and Ziou, Djemel |
ID Code: | 977410 |
Deposited By: | WEIJIA SU |
Deposited On: | 16 Jun 2017 15:49 |
Last Modified: | 18 Jan 2018 17:44 |
Repository Staff Only: item control page