Login | Register

Content-based genre classification of large texts


Content-based genre classification of large texts

Shahin, Amr (2019) Content-based genre classification of large texts. Masters thesis, Concordia University.

[thumbnail of Shahin_MSc_S2019.pdf.pdf]
Text (application/pdf)
Shahin_MSc_S2019.pdf.pdf - Accepted Version
Available under License Spectrum Terms of Access.


The advent of Natural Language Processing (NLP) and deep learning allows us to
achieve tasks that sounded impossible about 10 years ago, one of those tasks is genre
classification for large text bodies. Movies, books, novels, and various other texts
more often than not, belong to one or more genres, the purpose of this research is
to classify those texts into their genres while also calculating the weighed presence of
this genre in the aforementioned texts. Movies in particular are classified into genres
mostly for marketing purposes, and with no indication on which genre is the most
In this thesis, we explore the possibility of using deep neural networks and NLP to
classify movies using the contents of the movie script. We follow the philosophy that
scenes makes movies and generate the final result based on the classification of each
individual scene. the results were obtained by training Convolutional Neural Networks
(ConvNet or CNN) and Hierarchical Attention Networks (HAN) and compare their
performance to the de-facto architectures for NLP, namely Recurrent Neural Networks
(RNN) and Attention Models.
The results we got on the validation data-set are comparable to those obtained by
similar research done mostly on sentiment analysis or rating predictions, the accuracy
is about 85% which is an acceptable measure in the literature. We dedicated a part
of our conclusion discussing how our models would perform on a larger dataset and
what steps could be taken to increase the accuracy.

Item Type:Thesis (Masters)
Authors:Shahin, Amr
Institution:Concordia University
Degree Name:M. Comp. Sc.
Program:Computer Science
Date:May 2019
Thesis Supervisor(s):Krzyzak, Adam
ID Code:985410
Deposited By: Amr Shahin
Deposited On:06 Feb 2020 02:47
Last Modified:17 Feb 2021 00:00
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Research related to the current document (at the CORE website)
- Research related to the current document (at the CORE website)
Back to top Back to top