Vector Representation of Documents using Word Clusters

Title:

Vector Representation of Documents using Word Clusters

Bansal, Sunanda ORCID: https://orcid.org/0000-0002-4367-6778 (2021) Vector Representation of Documents using Word Clusters. Masters thesis, Concordia University.

Preview

Text (application/pdf)
Bansal_MCompSc_F2021.pdf - Accepted Version

1MB

Abstract

For processing the textual data using statistical methods like Machine Learning (ML), the data often needs to be represented in the form of a vector. With the dawn of the internet, the amount of textual data has exploded, and, partly owing to its size, most of this data is unlabeled. Therefore, often for sorting and analyzing text documents, the documents have to be represented in an unsupervised way, i.e. with no prior knowledge of expected output or labels. Most of the existing unsupervised methodologies do not factor in the similarity between words, and if they do, it can be further improved upon. This thesis discusses Word Cluster based Document Embedding (WcDe) where the documents are represented in terms of clusters of similar words and, compares its performance in representing documents at two levels of topical similarity - general and specific. This thesis shows that WcDe outperforms existing unsupervised representation methodologies at both levels of topical similarity. Furthermore, this thesis analyzes variations of WcDe with respect to its components and discusses the combination of components that consistently performs well across both topical levels. Finally, this thesis analyses the document vector generated by WcDe on two fronts, i.e. whether it captures the similarity of documents within a class, and whether it captures the dissimilarity of documents belonging to different classes. The analysis shows that Word Cluster based Document Embedding is able to encode both aspects of document representation very well and on both of the topical levels.

Divisions:	Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:	Thesis (Masters)
Authors:	Bansal, Sunanda
Institution:	Concordia University
Degree Name:	M. Comp. Sc.
Program:	Computer Science
Date:	9 August 2021
Thesis Supervisor(s):	Bergler, Sabine
Keywords:	document, vector, representation, word, embedding, word cluster, unsupervised
ID Code:	988721
Deposited By:	Sunanda Bansal
Deposited On:	29 Nov 2021 16:25
Last Modified:	29 Nov 2021 16:25

References:

Agirre, Eneko, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei
Guo, Iñigo Lopez-Gazpio, Montse Maritxalar, Rada Mihalcea, German Rigau, Larraitz Uria,
and Janyce Wiebe (June 2015). “SemEval-2015 Task 2: Semantic Textual Similarity, English,
Spanish and Pilot on Interpretability”. In: Proceedings of the 9th International Workshop on
Semantic Evaluation (SemEval 2015). Denver, Colorado: Association for Computational Linguistics,
pp. 252–263.

Agirre, Eneko, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei
Guo, Rada Mihalcea, German Rigau, and Janyce Wiebe (Aug. 2014). “SemEval-2014 Task
10: Multilingual Semantic Textual Similarity”. In: Proceedings of the 8th International Workshop
on Semantic Evaluation (SemEval 2014). Dublin, Ireland: Association for Computational
Linguistics, pp. 81–91.

Agirre, Eneko, Daniel Cer, Mona Diab, and Aitor Gonzalez-Agirre (June 2012). “SemEval-2012 Task
6: A Pilot on Semantic Textual Similarity”. In: *SEM 2012: The First Joint Conference on Lexical
and Computational Semantics – Volume 1: Proceedings of the main conference and the shared
task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
(SemEval 2012). Montréal, Canada: Association for Computational Linguistics, pp. 385–393.

Agirre, Eneko, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, and Weiwei Guo (June 2013). “*SEM
2013 shared task: Semantic Textual Similarity”. In: Second Joint Conference on Lexical and Computational
Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared
Task: Semantic Textual Similarity. Atlanta, Georgia, USA: Association for Computational Linguistics,
pp. 32–43.

Aldous, David J. (1985). “Exchangeability and related topics”. In: École d’Été de Probabilités de
Saint-Flour XIII — 1983. Ed. by P. L. Hennequin. Vol. 117. Springer Berlin Heidelberg, pp. 1–
198.

Arora, Sanjeev, Yingyu Liang, and Tengyu Ma (2017). “A simple but tough-to-beat baseline for
sentence embeddings”. In: 5th International Conference on Learning Representations (ICLR).

Arthur, David and Sergei Vassilvitskii (2007). “K-means++: The advantages of careful seeding”. In:
Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. Society
for Industrial and Applied Mathematics, pp. 1027–1035.

Becker, Hila (2011). “Identification and characterization of events in social media”. PhD dissertation.
Columbia University.

Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, Christian Jauvin, Jauvinc@iro Umontreal Ca,
Jaz Kandola, Thomas Hofmann, Tomaso Poggio, and John Shawe-Taylor (2003). “A Neural
Probabilistic Language Model”. In: Journal of Machine Learning Research 3, pp. 1137–1155.
Blei, David M, Andrew Y Ng, and Michael I. Jordan (2003). “Latent Dirichlet Allocation”. In:
Journal of Machine Learning Research 3, pp. 993–1022.

Blei, David M. (2012). “Probabilistic topic models”. In: Communications of the ACM 55.4, pp. 77–
84.

Cover, Thomas M and Joy A Thomas (2005). “Entropy, Relative Entropy, and Mutual Information”.
In: Elements of Information Theory. John Wiley & Sons, Ltd. Chap. 2, pp. 12–49.

Dai, Xiangfeng, Marwan Bikdash, and Bradley Meyer (2017). “From social media to public health
surveillance: Word embedding based clustering method for twitter classification”. In: Southeast-
Con 2017, pp. 1–7.

De Boom, Cedric, Steven Van Canneyt, Thomas Demeester, and Bart Dhoedt (Sept. 2016). “Representation
Learning for Very Short Texts Using Weighted Word Embedding Aggregation”. In:
Pattern Recognition Letters 80.C, pp. 150–156.

Deerwester, Scott, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman
(1990). “Indexing by latent semantic analysis”. In: Journal of the American Society for
Information Science 41.6, pp. 391–407.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova (June 2019). “BERT: Pretraining
of Deep Bidirectional Transformers for Language Understanding”. In: Proceedings of the
2019 Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota:
Association for Computational Linguistics, pp. 4171–4186.

Firth, John R (1957). “A synopsis of linguistic theory, 1930-1955”. In: Studies in Linguistic Analysis.
Oxford, pp. 1–32.

Fowlkes, E B and C L Mallows (1983). “A method for comparing two hierarchical clusterings”. In:
Journal of the American Statistical Association 78.383, pp. 553–569.

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville (2016). “Linear Algebra”. In: Deep Learning.
MIT Press. Chap. 2, pp. 29–50.

Hoffman, Matthew D, David M Blei, and Francis Bach (2010). “Online learning for Latent Dirichlet
Allocation”. In: Proceedings of the 23rd International Conference on Neural Information Processing
Systems. Vol. 1.

Hubert, Lawrence and Phipps Arabie (1985). “Comparing partitions”. In: Journal of Classification
2.1, pp. 193–218.

Jurafsky, Daniel and James H Martin (2021). “Vector Semantics and Embeddings”. In: Speech and
Language Processing: An Introduction to Natural Language Processing, Computational Linguistics,
and Speech Recognition. 3rd ed. (draft of December 30, 2020). Chap. 6.

Kim, Han Kyul, Hyunjoong Kim, and Sungzoon Cho (2017). “Bag-of-concepts: Comprehending
document representation through clustering words in distributed representation”. In: Neurocomputing
266, pp. 336–352.

Le, Quoc and Tomas Mikolov (2014). “Distributed Representations of Sentences and Documents”.
In: Proceedings of the 31st International Conference on Machine Learning. Ed. by Eric P. Xing
and Tony Jebara. Vol. 32. Proceedings of Machine Learning Research 2. Bejing, China: PMLR,
pp. 1188–1196.

Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze (2008a). Flat Clustering. Cambridge
University Press. Chap. 16, pp. 349–376.
— (2008b). Hierarchical Clustering. Cambridge University Press. Chap. 17, pp. 377–402.
— (2008c). Introduction to Information Retrieval. Cambridge University Press, p. 505.
— (2008d). Matrix decompositions and latent semantic indexing. Cambridge University Press. Chap. 18,
pp. 403–420.

Mekala, Dheeraj, Vivek Gupta, Bhargavi Paranjape, and Harish Karnick (2017). SCDV: Sparse
Composite Document Vectors using soft clustering over distributional representations. Tech. rep.,
pp. 659–669.

Mikolov, Tomás, Kai Chen, Greg Corrado, and Jeffrey Dean (2013). “Efficient Estimation of Word
Representations in Vector Space”. In: 1st International Conference on Learning Representations,
ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings. Ed. by
Yoshua Bengio and Yann LeCun.

Pennington, Jeffrey, Richard Socher, and Christopher D Manning (2014). “Glove: Global vectors
for word representation”. In: Proceedings of the 2014 conference on empirical methods in natural
language processing (EMNLP), pp. 1532–1543.

Peters, Matthew E., Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer (2018). “Deep contextualized word representations”. In: Proceedings of the
2018 Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long Papers). New Orleans, Louisiana: Association
for Computational Linguistics, pp. 2227–2237.

Qimin, Cao, Guo Qiao, Wang Yongliang, and Wu Xianghua (2015). “Text clustering using VSM
with feature clusters”. In: Neural Computing and Applications 26.4, pp. 995–1003.

Reynolds, Douglas (2015). “Gaussian Mixture Models”. In: Encyclopedia of Biometrics. Ed. by Stan
Z Li and Anil K Jain. Boston, MA: Springer US, pp. 827–832.

Richards, Blake A, Timothy P Lillicrap, Philippe Beaudoin, Yoshua Bengio, Rafal Bogacz, Amelia
Christensen, Claudia Clopath, Rui Ponte Costa, Archy de Berker, Surya Ganguli, Colleen J.
Gillon, Danijar Hafner, Adam Kepecs, Nikolaus Kriegeskorte, Peter Latham, Grace W. Lindsay,
Kenneth D. Miller, Richard Naud, Christopher C. Pack, Panayiota Poirazi, Pieter Roelfsema,
João Sacramento, Andrew Saxe, Benjamin Scellier, Anna C. Schapiro, Walter Senn, Greg Wayne,
Daniel Yamins, Friedemann Zenke, Joel Zylberberg, Denis Therien, and Konrad P. Kording
(2019). “A deep learning framework for neuroscience”. In: Nature Neuroscience 22.11, pp. 1761–
1770.

Rong, Xin (2014). “word2vec Parameter Learning Explained”. In: CoRR abs/1411.2738.

Rosenberg, Andrew and Julia Hirschberg (June 2007). “V-Measure: A Conditional Entropy-Based
External Cluster Evaluation Measure”. In: Proceedings of the 2007 Joint Conference on Empirical
Methods in Natural Language Processing and Computational Natural Language Learning
(EMNLP-CoNLL). Prague, Czech Republic: Association for Computational Linguistics, pp. 410–
420.

Seifollahi, Sattar, Massimo Piccardi, Ehsan Zare Borzeshi, and Bernie Kruger (2019). “Taxonomy-
Augmented Features for Document Clustering”. In: Australasian Conference on Data Mining
(AusDM): Data Mining. Ed. by Rafiqul Islam, Yun Sing Koh, Yanchang Zhao, Graco Warwick,
David Stirling, Chang-Tsun Li, and Zahidul Islam. Communications in Computer and Information
Science (CCIS), Vol. 996. Springer Singapore, pp. 241–252.

Socher, Richard (2015). CS 224 D : Deep Learning for NLP (Lecture Notes: Part I). Available at
https://cs224d.stanford.edu/lecture_notes/LectureNotes1.pdf (2021/08/03).

Wikimedia (2012). English Wikipedia dump. Available at http://dumps.wikimedia.org/enwiki/
latest/enwiki-latest-pages-articles.xml.bz2 (2021/02/06).

Wikipedia contributors (2021). Mutual information – Wikipedia, The Free Encyclopedia. Available at
https://en.wikipedia.org/w/index.php?title=Mutual_information&oldid=1035953322
(2021/08/03).

Repository Staff Only: item control page

Download Statistics

Downloads per month over past year

Research related to the current document (at the CORE website)

Spectrum Research Repository

Vector Representation of Documents using Word Clusters

Vector Representation of Documents using Word Clusters

Abstract

References: