Reinforcement Learning for Mitigating Toxicity in Neural Dialogue Systems

Title:

Reinforcement Learning for Mitigating Toxicity in Neural Dialogue Systems

Faal, Farshid (2022) Reinforcement Learning for Mitigating Toxicity in Neural Dialogue Systems. PhD thesis, Concordia University.

Preview

Text (application/pdf)
Faal_PhD_F2022.pdf - Accepted Version
Available under License Spectrum Terms of Access.

1MB

Abstract

Developing a machine that can hold an engaging conversation with a human is one of the main challenges in designing an open-domain dialogue system in the field of natural language processing.
With the advancement of deep learning techniques and the availability of large amounts of data on human-to-human conversational interaction, a fully data-driven and holistic approach is considered to design open-domain dialogue systems.
Dialogue generation models trained on large corpora of human-to-human interactions learn undesirable features and mimic behaviors from data, including toxic language, gender, and racial biases.
Hence, as dialogue systems become more widespread and trusted, developing such systems that account for possible safety concerns is vital.
In the first part of the thesis, we address the limitations of training the open-domain dialogue generation model with the log-likelihood method, and we propose the Reinforce Transformer-decoder model, our novel approach for training the Transformer-decoder based conversational model, which incorporates proximal policy optimization techniques from reinforcement learning with the Transformer-decoder architecture.
We specifically examine the use of our proposed model for multi-turn open-domain dialogue response generation on the Reddit dialogues data, a real-word human-to-human dataset. Experiments demonstrate that responses generated by our proposed neural dialogue response generation model are diverse and contain information specific to the source prompt based on diversity and relevance evaluation metrics.

In the second part of the thesis, we propose a new approach based on the domain adaptation language model and multitask deep neural network to detect and identify the toxic language in the textual content.
We argue that the first step in managing toxic language risk is identification, but algorithmic approaches have demonstrated bias. Texts containing some demographic identity terms such as Muslim, Jewish, Asian, or Black are more likely to be labeled as toxic in existing toxic language detection datasets. In many machine learning models introduced for toxic language detection, non-toxic comments containing minority and marginalized community-specific identity terms were given unreasonably high toxicity scores. To address the challenge of bias in toxic language detection, we employ six toxic language detection and identification tasks to train the model to detect toxic contents and mitigate unintended bias in model prediction.
We evaluate and compare our model with other state-of-the-art deep learning models using specific performance metrics to measure the model bias. In detailed experiments, we show our approach can identify toxic language in textual content with considerably more robust to model bias towards commonly-attacked identity groups presented in the textual content. Moreover, the experimental results illustrate that jointly training the pretrained language model with a multitask objective can effectively mitigate the impacts of unintended biases and is more robust to model bias towards commonly-attacked identity groups presented in datasets without significantly hurting the model's generalizability.
In the third part of the thesis, we propose our approach to mitigate toxic language generation by neural generative language models and conversational AI systems.
Transformer-based language models can generate fluent text and efficiently adapt various natural language generation tasks.
However, language models that are pretrained on large unlabeled web text corpora have suffered from degenerating toxic content and social bias, hindering their safe deployment for fine-tuning dialogue response generation systems.
Various detoxification methods have been proposed to mitigate language model toxicity; however, these methods struggle to detoxify language models when conditioned on prompts that contain specific social identities related to gender, race, or religion.
In this study, we propose Reinforce-Detoxify, a reinforcement learning-based method for mitigating toxicity in language models.
Reinforce-Detoxify is formulated as an autoregressive LM and uses a multilayer transformer-decoder as the model architecture.
We address the effect of detoxification methods on language generation from LMs towards social identities. We propose a reward model based on multitask learning that can mitigate unintended bias related to various social identities in toxicity prediction.
We employ our multitask deep neural network model to mitigate unintended bias in toxicity prediction related to various social identities as a reward function for fine-tuning the generative model.
Furthermore, to prevent the unfavorable effect of detoxification on language model fluency, we penalize the Kullback Leibler divergence between the learned policy and the original LM that we used to initialize the policy.
Empirical results demonstrate that utilizing reinforcement learning for fine-tuning the language models to maximize the reward can mitigate toxic language generation and outperform the current detoxification methods in the literature. Furthermore, we have shown that utilizing a reward model trained to reduce unintended bias towards various social identities successfully enables the language models to mitigate toxicity when conditioned on prompts related to these social identities.

Divisions:	Concordia University > Gina Cody School of Engineering and Computer Science > Concordia Institute for Information Systems Engineering
Item Type:	Thesis (PhD)
Authors:	Faal, Farshid
Institution:	Concordia University
Degree Name:	Ph. D.
Program:	Information and Systems Engineering
Date:	14 July 2022
Thesis Supervisor(s):	Schmitt, Ketra
ID Code:	991234
Deposited By:	FARSHID FAAL
Deposited On:	12 Oct 2022 20:34
Last Modified:	27 Oct 2022 14:32

Repository Staff Only: item control page

Download Statistics

Downloads per month over past year

Research related to the current document (at the CORE website)

Spectrum Research Repository

Reinforcement Learning for Mitigating Toxicity in Neural Dialogue Systems

Reinforcement Learning for Mitigating Toxicity in Neural Dialogue Systems

Abstract