Novel Deep Learning Approaches for Single-Channel Speech Enhancement

Title:

Novel Deep Learning Approaches for Single-Channel Speech Enhancement

Wang, Kai (2022) Novel Deep Learning Approaches for Single-Channel Speech Enhancement. Masters thesis, Concordia University.

Preview

Text (application/pdf)
Wang_MASc_F2022.pdf - Accepted Version
Available under License Spectrum Terms of Access.

3MB

Abstract

Acquiring speech signal in real-world environment is always accompanied by various ambient noises, which results in the degraded intelligibility and quality of the speech. To make listeners hear the high-quality speech signal, speech enhancement is thus necessary, which aims to estimate the clean speech from noisy mixture by removing the background noises. Recently, deep-learning approaches, as a powerful tool, have impressively promoted the advancement of speech enhancement, which usually trains a neural network to learn the mapping function from the noisy speech to the clean speech with supervised learning. However, commonly used neural networks cannot efficiently leverage contextual information of speech signal while involving large footprint. For example, convolutional neural networks require to be deep enough for capturing sufficient receptive field due to the intrinsic locality. Besides, the recurrent neural networks cannot learn the long-term dependency of long speech sequences and moreover are computational-expensive because of the sequential processing.

To learn the contextual information of long-range speech sequencies, in the first contribution of this thesis, two novel attention-based transformer neural networks are proposed for single-channel speech enhancement: two-stage transformer and cross-parallel transformer neural networks, termed as TSTNN and CPTNN, respectively. The proposed TSTNN adopts multiple pairs of local and global transformer with cascaded connection to successively extract both local information of individual frames and global information across various frames, generating the contextual feature representation. To overcome the information leakage caused by shifting from one path to another in cascaded structure, the proposed CPTNN employs cross-parallel transformer blocks to extract local and global information in parallel, whose outputs are then adaptively fused by inner cross-attention mechanism for the generation of contextual information. Both proposed architectures incorporate the transformer blocks between convolutional encoder and decoder, where the outputs of transformer blocks are fed into a masking module to create a mask for filtering the encoder outputs which will be transformed to enhanced speech via decoder. Extensive experiments indicate that the proposed TSTNN and CPTNN achieve a competitive or even superior performance compared to existing state-of-the-art methods in all evaluation metrics while having fewest trainable parameters.

While many competitive performances are realized by networks involving attention mechanism such as transformer as a part of them, it is worth-studying whether the attention is indispensable for resolving long-term dependency problem in speech enhancement, especially when considering the trade-off between computational efficiency and denoising performance. In the second contribution of this thesis, we propose an attention-free architecture based on multi-layer perceptrons (MLPs) for speech enhancement, named SE-Mixer, which consists of an encoder, a decoder and multiple mixer blocks in between. The mixer block is designed for efficiently extracting contextual information of long-range speech sequences. It employs temporal MLP in conjunction convolution and frequency MLP to iteratively capture multi-scale temporal information from various time scales and extract frequency information within each time step. Our experimental results demonstrate that the proposed SE-Mixer has a competitive performance and relatively low parameters compared to existing methods and without attention mechanism incorporated. We have also shown that the architecture without attention algorithm is likely to reach approximately the same effectiveness as compared with attention-assisted models but with significantly lower computational complexity.

Divisions:	Concordia University > Gina Cody School of Engineering and Computer Science > Electrical and Computer Engineering
Item Type:	Thesis (Masters)
Authors:	Wang, Kai
Institution:	Concordia University
Degree Name:	M.A. Sc.
Program:	Electrical and Computer Engineering
Date:	25 July 2022
Thesis Supervisor(s):	Zhu, Wei-Ping
ID Code:	990859
Deposited By:	Kai Wang
Deposited On:	27 Oct 2022 14:27
Last Modified:	27 Oct 2022 14:27

Repository Staff Only: item control page

Download Statistics

Downloads per month over past year

Research related to the current document (at the CORE website)

Spectrum Research Repository

Novel Deep Learning Approaches for Single-Channel Speech Enhancement

Novel Deep Learning Approaches for Single-Channel Speech Enhancement

Abstract