Classification of text documents and extraction of semantically related words using hierarchical Latent Dirichlet Allocation

Title:

Classification of text documents and extraction of semantically related words using hierarchical Latent Dirichlet Allocation

Chatri, Imane (2015) Classification of text documents and extraction of semantically related words using hierarchical Latent Dirichlet Allocation. Masters thesis, Concordia University.

Preview

Text (application/pdf)
Chatri_MASc_S2015.pdf - Accepted Version

1MB

Abstract

The amount of available data in our world has been exploding lately. Effectively managing large and growing collections of information is of utmost importance because of criticality and importance of these data to different entities and companies (government, security, education, tourism, health, insurance, finance, etc.). In the field of security, many cyber criminals and victims alike share their experiences via forums, social media and other cyber platforms. These data can in fact provide significant information to people operating in the security field. That is why more and more computer scientists turned to study data classification and topic models. However, processing and analyzing all these data is a difficult task.
In this thesis, we have developed an efficient machine learning approach based on hierarchical extension of the Latent Dirichlet Allocation model to classify textual documents and to extract semantically related words. A variational approach is developed to infer and learn the different parameters of the hierarchical model to represent and classify our data. The data we are dealing with in the scope of this thesis is textual data for which many frameworks have been developed and will be looked at in this thesis. Our model is able to classify textual documents into distinct categories and to extract semantically related words in a collection of textual documents. We also show that our proposed model improves the efficiency of the previously proposed models. This work is part of a large cyber-crime forensics system whose goal is to analyze and discover all kind of information and data as well as the correlation between them in order to help security agencies in their investigations and help with the gathering of critical data.

Divisions:	Concordia University > Gina Cody School of Engineering and Computer Science > Concordia Institute for Information Systems Engineering
Item Type:	Thesis (Masters)
Authors:	Chatri, Imane
Institution:	Concordia University
Degree Name:	M.A. Sc.
Program:	Quality Systems Engineering
Date:	27 March 2015
Thesis Supervisor(s):	Bouguila, Nizar and Djemel, Ziou
ID Code:	979805
Deposited By:	IMANE CHATRI
Deposited On:	13 Jul 2015 14:07
Last Modified:	02 Apr 2019 20:05

References:

[1] Landauer, T., Foltz, P., Laham, D.: An Introduction to Latent Semantic Analysis (1998). Discourse Processes, 25, 259-284.
[2] Deerwester, S.: Improving Information Retrieval with Latent Semantic Indexing. Proceedings of the 51st ASIS Annual Meeting (ASIS ’88), volume 25, Atlanta, Georgia, October 1988. American Society for Information Science.
[3] Bishop, C.: “Pattern Recognition and Machine Learning.” (Information Science and Statistics), Springer, 2006
[4] Edmunds, A. and Morris, A.: The problem of information overload in business organisations: a review of the literature. International Journal of Information Management, 20(1):17-28, 2000.
[5] Blei, D.M., Lafferty, J.D.: A correlated topic model of Science. Annals of Applied Statistics 1(1), 17–35 (Aug 2007)
[6] D. Blei, J. McAuliffe. Supervised topic models. Neural Information Processing Systems 21, 2007.
[7] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, January 2003.
[8] Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. Journal of the American Society for Information Science (1990)
[9] Grifﬁths, T.L., Steyvers, M.: Finding scientiﬁc topics. Proceedings of the National Academy of Sciences of the United States of America 101, 5228–5235 (Apr 2004)
[10] B. Rosario, "Latent Semantic Indexing: An overview," School of Info. Management & Systems, U.C. Berkeley, 2000
[11] Hofmann, T., Cai, L., Ciaramita, M.: Learning with taxonomies: Classifying documents and words. In: Proceedings of Synatx, Semantics and Statistics NIPS Workshop (2003)

[12] W. Su, D. Ziou and N. Bouguila, “A Hierarchical Statistical Framework for the Extraction of Semantically Related Words in Textual Documents”, Proc. Of the 8th International Conference on Rough Sets and Knowledge Technology (RSKT 2013), Lecture Notes in Computer Science 8171, pp. 354-363, Halifax, Canada, 2013.
[13] Maas, A., Ng, A.: A Probabilistic Model for Semantic Word Vectors. In: Deep Learning and Unsupervised Feature Learning Workshop NIPS 2010. vol. 10 (2010)
[14] MacKay, D. and Bauman Peto, L.: A hierarchical Dirichlet language model. Natural Language Engineering, Vol 1, Issue 3 pp 289-308. Cambridge University Press (1995)
[15] Hofmann, T.: Unsupervised Learning by Probabilistic Latent Semantic Analysis. In: Machine Learning Journal, 42, 177-196, 2001.
[16] Lobanova, A., Spenader, J., Van de Cruys, T., Van der Kleij, T. and Tjong Kim Sang, E.: Automatic Relation Extraction - Can Synonym Extraction Benefit from Antonym Knowledge? In: NODALIDA 2009 workshop WordNets and other Lexical Semantic Resources - between Lexical Semantics, Lexicography, Terminology and Formal Ontologies, Odense, Denmark.
[17] Z. Liu, M. Li, Y. Liu and M. Ponraj, Performance Evaluation of Latent Dirichlet Allocation in Text Mining, Proc. of IEEE pp. 2761-2764.
[18] Hoffman, M., Blei, D., Paisley, J. and Wang, C.: Stochastic variational inference. Journal of Machine Learning Research, 14:1303-1347, 2013.
[19] Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. pp. 50–57. SIGIR ’99 (1999)
[20] Jahiruddin, Abulaish M, Dey L: A concept-driven biomedical knowledge extraction and visualization framework for conceptualization of text corpora. J Biomed Inform. 2010 Dec; 43(6):1020-35.
[21] Blei, D.: Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012.
[22] Salton, G. and McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill, 1983.

[23] Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled
and unlabeled documents using EM. Journal of Machine Learning Research 39(2-3), 103–134
(May 2000).
[24] Denning, P.J., Denning, D.E.: Discussing cyber attack. Communications of the ACM 53(9),
29–31 (Sep 2010)
[25] Goel, S.: Cyberwarfare: connecting the dots in cyber intelligence. Commun. ACM 54(8),
132–140 (Aug 2011)
[26] Blei, D., Lafferty, J.: Dynamic Topic Models. In: Proceedings of the 23rd international Conference on Machine Learning. ICML '06, 113- 120.

Repository Staff Only: item control page

Download Statistics

Downloads per month over past year

Research related to the current document (at the CORE website)

Spectrum Research Repository

Classification of text documents and extraction of semantically related words using hierarchical Latent Dirichlet Allocation

Classification of text documents and extraction of semantically related words using hierarchical Latent Dirichlet Allocation

Abstract

References: