Chapados Muermans, Thomas (2022) Investigating the Use of Transformer Based Embeddings for Multilingual Discourse Connective Identification. Masters thesis, Concordia University.
Preview |
Text (application/pdf)
2MBChapados-Muermans_MCompSci_S2022.pdf - Accepted Version |
Abstract
In this thesis, we report on our experiments toward multilingual discourse connective (or DC) identification and show how language-specific BERT models seem to be sufficient even with little task-specific training data and do not require any additional handcrafted features to achieve strong results. Although some languages are under-resourced and do not have large annotated discourse connective corpora. To address this, we developed a methodology to induce large synthetic discourse annotated corpora using a parallel word aligned corpus. We evaluated our models in 3 languages: English, Turkish, and Mandarin Chinese; and applied our induction methodology on English-Turkish and English-Chinese. All our models were evaluated in the context of the recent DISRPT 2021 Task 2 shared task. Results show that the F-measure achieved by our simple approach (93.12%, 94.42%, 87.47% for English, Turkish and Chinese) are near or at state-of-the-art for the 3 languages while being simple and not requiring any handcrafted features.
| Divisions: | Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering |
|---|---|
| Item Type: | Thesis (Masters) |
| Authors: | Chapados Muermans, Thomas |
| Institution: | Concordia University |
| Degree Name: | M. Comp. Sc. |
| Program: | Computer Science |
| Date: | 26 May 2022 |
| Thesis Supervisor(s): | Kosseim, Leila |
| ID Code: | 990633 |
| Deposited By: | THOMAS CHAPADOS MUERMANS |
| Deposited On: | 27 Oct 2022 14:24 |
| Last Modified: | 27 Oct 2022 14:24 |
Repository Staff Only: item control page


Download Statistics
Download Statistics