Chapados Muermans, Thomas (2022) Investigating the Use of Transformer Based Embeddings for Multilingual Discourse Connective Identification. Masters thesis, Concordia University.
Preview |
Text (application/pdf)
2MBChapados-Muermans_MCompSci_S2022.pdf - Accepted Version |
Abstract
In this thesis, we report on our experiments toward multilingual discourse connective (or DC) identification and show how language-specific BERT models seem to be sufficient even with little task-specific training data and do not require any additional handcrafted features to achieve strong results. Although some languages are under-resourced and do not have large annotated discourse connective corpora. To address this, we developed a methodology to induce large synthetic discourse annotated corpora using a parallel word aligned corpus. We evaluated our models in 3 languages: English, Turkish, and Mandarin Chinese; and applied our induction methodology on English-Turkish and English-Chinese. All our models were evaluated in the context of the recent DISRPT 2021 Task 2 shared task. Results show that the F-measure achieved by our simple approach (93.12%, 94.42%, 87.47% for English, Turkish and Chinese) are near or at state-of-the-art for the 3 languages while being simple and not requiring any handcrafted features.
Divisions: | Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering |
---|---|
Item Type: | Thesis (Masters) |
Authors: | Chapados Muermans, Thomas |
Institution: | Concordia University |
Degree Name: | M. Comp. Sc. |
Program: | Computer Science |
Date: | 26 May 2022 |
Thesis Supervisor(s): | Kosseim, Leila |
ID Code: | 990633 |
Deposited By: | THOMAS CHAPADOS MUERMANS |
Deposited On: | 27 Oct 2022 14:24 |
Last Modified: | 27 Oct 2022 14:24 |
Repository Staff Only: item control page