Single channel speech enhancement based on U-Net

Title:

Single channel speech enhancement based on U-Net

He, Bengbeng (2022) Single channel speech enhancement based on U-Net. Masters thesis, Concordia University.

Preview

Text (application/pdf)
He_MASc_F2022.pdf - Accepted Version
Available under License Spectrum Terms of Access.

4MB

Abstract

Speech enhancement has found many applications in various fields involving speech processing. It aims to remove the background noise from the acquired speech signals to efficiently improve the speech intelligibility and quality. The development of deep learning approaches in recent years has significantly promoted the advancement of speech enhancement by treating it as an estimation problem with or without supervision. Many existing neural networks are based on U-Net structure, where the encoder of U-Net transforms the input speech into compressed features via removing the noise information and the decoder of U-Net reconstructs the enhanced speech from the speech features with a symmetric structure. However, the contextual information of speech sequences cannot be fully captured due to the intrinsically local operation of commonly used convolution in U-Net, which is extremely crucial for improving the performance of speech enhancement. To improve the capability of U-Net in extracting the long-term dependency of speech sequences, this thesis investigates different attention-based U-Nets to capture the abundant contextual information of long-range speech signals.

In the first contribution of this thesis, a dual-branch attention-assisted U-Net is proposed for single-channel speech enhancement, which consists of a dilated-dense encoder-decoder structure and a dual-branch attention mechanism in between. In the encoder, the high-level speech features are obtained by adopting multiple groups of a dilated-dense block and a down-sampling layer, where the densely dilated convolutions are employed to enlarge the receptive field and aggregate previous features. Next, the dual-branch attention utilizes the spatial-wise attention and channel-wise attention to parallelly extract spatial and channel information of speech features, which are then averaged to form the contextual feature representation. The decoder, as a symmetric structure of encoder, is adopted to transform the features back to denoised speech via multiple pairs of a dilated-dense block and an up-sampling layer, where the skip connection from encoder layers is used to boost the feature reconstruction. By comparing with other U-Net based methods, our proposed dual-branch attention-assisted U-Net achieves a comparable performance of evaluation metrics but with fairly low trainable parameters.

To further improve the performance of proposed attention-based U-Net, in the second contribution of this thesis, we incorporate the multi-head attention (MHA) mechanism into U-Net for extracting the features from different representation subspaces. First, we propose a two-stage MHA block to replace the dual-branch attention block in the U-Net structure proposed before, where the MHA block employs tandemly connected sample MHA and frame MHA to successively extract sample-level features of each individual frame and frame-level features among different frames, respectively, leading to better contextual speech features. However, the convolutional encoder-decoder used in the proposed U-Net still constrains the potentials of U-Net model to extract long-range dependency of speech sequences because of the local operation performed in convolution. To overcome this drawback, we further replace the convolution-based encoder and decoder layers by proposed two-stage MHA blocks to extract long-range relationship of speech sequences in the encoder-decoder level. Experimental results on a benchmark dataset shows that our two proposed MHA based U-Net models achieve a competitive performance among existing methods in all evaluation metrics while exhibiting a much lighter model complexity than other state-of-art networks.

Divisions:	Concordia University > Gina Cody School of Engineering and Computer Science > Electrical and Computer Engineering
Item Type:	Thesis (Masters)
Authors:	He, Bengbeng
Institution:	Concordia University
Degree Name:	M.A. Sc.
Program:	Electrical and Computer Engineering
Date:	30 July 2022
Thesis Supervisor(s):	Zhu, Wei-Ping
ID Code:	990860
Deposited By:	Bengbeng He
Deposited On:	21 Jun 2023 14:34
Last Modified:	21 Jun 2023 14:34

Repository Staff Only: item control page

Download Statistics

Downloads per month over past year

Research related to the current document (at the CORE website)

Spectrum Research Repository

Single channel speech enhancement based on U-Net

Single channel speech enhancement based on U-Net

Abstract