Login | Register

Automatic Keyword Tagging With Machine Learning Approach


Automatic Keyword Tagging With Machine Learning Approach

Shen, Xingyu (2021) Automatic Keyword Tagging With Machine Learning Approach. Masters thesis, Concordia University.

[thumbnail of Shen_MASc_S2022.pdf]
Text (application/pdf)
Shen_MASc_S2022.pdf - Accepted Version


With the explosive growth of information in the Internet age, the use of keywords has become the main tool for users to search for content of interest in a large amount of information. Keyword tagging can be divided into in-text keyword extraction and out-of-text keyword assignment. Keyword extraction is an important area in natural language processing (NLP), but the technology still has a lot of immaturity. Traditional keyword extraction methods are difficult to meet the commonly desired three characteristics simultaneously, i.e., understandability, relevance and good coverage, and thus even now in Web 2.0 many tags of web pages are still tagged manually.
In this thesis, we propose a novel unsupervised keyword extraction method that integrates word embedding (GloVe and fastText) with clustering (Affinity Propagation, Mean Shift and K-means). We use semantic relevance to cluster the terms in a document, and extract the noun phrase nearest to the center of the cluster as the keyword. This method ensures that the extracted keywords satisfy the above three characteristics at the same time. Our computer simulation results based on Hulth-2003, Krapivin-2009 and Nguyen-2007 datasets show that the proposed method outperforms all other existing methods in terms of common evaluation metrics such as Precision, Recall and F1-Score.
This thesis also proposes a CNN-BiLSTM model for keyword assignment, which uses word embedding method and attention mechanism. This model overcomes the limitation of single CNN model in ignoring the semantic and syntactic information of the input context, and effectively avoids the problem of gradient disappearance or gradient diffusion in traditional RNNs. Moreover, the use of attention mechanism can highlight important information and avoid the influence of invalid information on text sentiment and classification. Experimental results on three datasets, i.e., 20 Newsgroups, IMDB, SemEval 2018 task-1, show that the proposed keyword assignment method outperforms previous methods in terms of common evaluation metrics such as F1-Score, Accuracy and AUC, indicating the wide applicability of our method to various datasets.

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science > Electrical and Computer Engineering
Item Type:Thesis (Masters)
Authors:Shen, Xingyu
Institution:Concordia University
Degree Name:M.A. Sc.
Program:Electrical and Computer Engineering
Date:21 December 2021
Thesis Supervisor(s):Zhu, Wei-Ping and Moazzen, Iman Moazzen
ID Code:990187
Deposited By: Xingyu Shen
Deposited On:16 Jun 2022 15:11
Last Modified:16 Jun 2022 15:11
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Research related to the current document (at the CORE website)
- Research related to the current document (at the CORE website)
Back to top Back to top