Shahin, Amr (2019) Content-based genre classification of large texts. Masters thesis, Concordia University.
Preview |
Text (application/pdf)
5MBShahin_MSc_S2019.pdf.pdf - Accepted Version Available under License Spectrum Terms of Access. |
Abstract
The advent of Natural Language Processing (NLP) and deep learning allows us to
achieve tasks that sounded impossible about 10 years ago, one of those tasks is genre
classification for large text bodies. Movies, books, novels, and various other texts
more often than not, belong to one or more genres, the purpose of this research is
to classify those texts into their genres while also calculating the weighed presence of
this genre in the aforementioned texts. Movies in particular are classified into genres
mostly for marketing purposes, and with no indication on which genre is the most
autocratic.
In this thesis, we explore the possibility of using deep neural networks and NLP to
classify movies using the contents of the movie script. We follow the philosophy that
scenes makes movies and generate the final result based on the classification of each
individual scene. the results were obtained by training Convolutional Neural Networks
(ConvNet or CNN) and Hierarchical Attention Networks (HAN) and compare their
performance to the de-facto architectures for NLP, namely Recurrent Neural Networks
(RNN) and Attention Models.
The results we got on the validation data-set are comparable to those obtained by
similar research done mostly on sentiment analysis or rating predictions, the accuracy
is about 85% which is an acceptable measure in the literature. We dedicated a part
iii
of our conclusion discussing how our models would perform on a larger dataset and
what steps could be taken to increase the accuracy.
Item Type: | Thesis (Masters) |
---|---|
Authors: | Shahin, Amr |
Institution: | Concordia University |
Degree Name: | M. Comp. Sc. |
Program: | Computer Science |
Date: | May 2019 |
Thesis Supervisor(s): | Krzyzak, Adam |
ID Code: | 985410 |
Deposited By: | Amr Shahin |
Deposited On: | 06 Feb 2020 02:47 |
Last Modified: | 17 Feb 2021 00:00 |
Repository Staff Only: item control page