Davoodi, Elnaz (2017) Computational Discourse Analysis Across Complexity Levels. PhD thesis, Concordia University.
Preview |
Text (application/pdf)
610kBDavoodi_PhD_2017.pdf - Accepted Version |
Abstract
The focus of this thesis is to study computationally the relation between discourse properties and textual complexity. Specifically, we explored three research questions.
The first research question tries to find out to what degree discourse-level properties can be used to predict the complexity level of a text. To do so, we considered three types of discourse-level properties: (1) the realization of discourse relations and the representation of discourse relations in terms of (2) the choice of discourse relation and (3) discourse marker. Using datasets from standard corpora in the field of discourse analysis and text simplification, we developed a supervised machine learning model for pairwise text complexity assessment and compared these properties with more linguistic features. Our results show that the use of only discourse features performed statistically as well as using traditional linguistic features. Thus, we can conclude a strong correlation between discourse properties and complexity level.
The second question that we explored is how exactly does the complexity level of a text influence its discourse-level linguistic choices? To address this question, we conducted a corpus analysis of the Simple English Wikipedia, the largest annotated corpus based on complexity level. Our analysis used the 16 discourse relations defined in the DLTAG framework and focused on explicit relations. Our results show that the distribution of discourse relations is not influenced by a text’s complexity level; but how these are signalled is.
Finally, given the results of our corpus analysis, our third research question tries to investigate if we can leverage these differences to mine parallel corpora across complexity levels to automatically discover alternative lexicalizations (AltLexes) of discourse markers. This work led to the automatic identification of 91 new AltLexes in two corpora: the Simple English Wikipedia and the Newsela corpora.
Overall, this thesis demonstrates that a text’s complexity level and discourse level properties are indeed correlated. Discourse properties play an important role in the assessment of a text’s complexity level and should be taken into account in the complexity level assessment problem. In addition, we observed that the way that explicit discourse relations are signaled is influenced by textual complexity. Lastly, our thesis shows that the automatic identification of alternative lexializations of discourse markers can benefit from large-scale parallel corpora across complexity levels.
Divisions: | Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering |
---|---|
Item Type: | Thesis (PhD) |
Authors: | Davoodi, Elnaz |
Institution: | Concordia University |
Degree Name: | Ph. D. |
Program: | Computer Science |
Date: | 31 August 2017 |
Thesis Supervisor(s): | Kosseim, Leila |
ID Code: | 982967 |
Deposited By: | ELNAZ DAVOODI |
Deposited On: | 08 Nov 2017 21:00 |
Last Modified: | 18 Jan 2018 17:56 |
Repository Staff Only: item control page