Login | Register

Single-Channel Speech Enhancement Based on Deep Neural Networks


Single-Channel Speech Enhancement Based on Deep Neural Networks

Ouyang, Zhiheng (2019) Single-Channel Speech Enhancement Based on Deep Neural Networks. Masters thesis, Concordia University.

[thumbnail of Ouyang_MASc_F2020.pdf]
Text (application/pdf)
Ouyang_MASc_F2020.pdf - Accepted Version


Speech enhancement (SE) aims to improve the speech quality of the degraded speech. Recently, researchers have resorted to deep-learning as a primary tool for speech enhancement, which often features deterministic models adopting supervised training. Typically, a neural network is trained as a mapping function to convert some features of noisy speech to certain targets that can be used to reconstruct clean speech. These methods of speech enhancement using neural networks have been focused on the estimation of spectral magnitude of clean speech considering that estimating spectral phase with neural networks is difficult due to the wrapping effect.

As an alternative, complex spectrum estimation implicitly resolves the phase estimation problem and has been proven to outperform spectral magnitude estimation. In the first contribution of this thesis, a fully convolutional neural network (FCN) is proposed for complex spectrogram estimation. Stacked frequency-dilated convolution is employed to obtain an exponential growth of the receptive field in frequency domain. The proposed network also features an efficient implementation that requires much fewer parameters as compared with conventional deep neural network (DNN) and convolutional neural network (CNN) while still yielding a comparable performance.

Consider that speech enhancement is only useful in noisy conditions, yet conventional SE methods often do not adapt to different noisy conditions. In the second contribution, we proposed a model that provides an automatic "on/off" switch for speech enhancement. It is capable of scaling its computational complexity under different signal-to-noise ratio (SNR) levels by detecting clean or near-clean speech which requires no processing. By adopting information maximizing generative adversarial network (InfoGAN) in a deterministic, supervised manner, we incorporate the functionality of SNR-indicator into the model that adds little additional cost to the system.

We evaluate the proposed SE methods with two objectives: speech intelligibility and application to automatic speech recognition (ASR). Experimental results have shown that the CNN-based model is applicable for both objectives while the InfoGAN-based model is more useful in terms of speech intelligibility. The experiments also show that SE for ASR may be more challenging than improving the speech intelligibility, where a series of factors, including training dataset and neural network models, would impact the ASR performance.

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science > Electrical and Computer Engineering
Item Type:Thesis (Masters)
Authors:Ouyang, Zhiheng
Institution:Concordia University
Degree Name:M.A. Sc.
Program:Electrical and Computer Engineering
Date:20 December 2019
Thesis Supervisor(s):Zhu, Wei-Ping
ID Code:986846
Deposited By: Zhiheng Ouyang
Deposited On:25 Nov 2020 16:28
Last Modified:25 Nov 2020 16:28
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Research related to the current document (at the CORE website)
- Research related to the current document (at the CORE website)
Back to top Back to top