Login | Register

Resource efficient deep learning approaches for monaural speech separation

Title:

Resource efficient deep learning approaches for monaural speech separation

Shi, Peiran (2024) Resource efficient deep learning approaches for monaural speech separation. Masters thesis, Concordia University.

[thumbnail of Shi_MASc_F2024.pdf]
Preview
Text (application/pdf)
Shi_MASc_F2024.pdf - Accepted Version
Available under License Spectrum Terms of Access.
15MB

Abstract

Speech separation is a critical task in processing naturalistic audio streams, aiming to extract individual speech sources from mixed speech signals. Monaural speech separation, which deals with audio from a single microphone, focuses on isolating overlapping speech signals, a process essential for applications such as automatic speech recognition and voice assistant devices. Recent advances in deep learning have significantly improved speech separation, typically by training neural networks to estimate high-quality separated speech from mixed signals using supervised
learning. However, most state-of-the-art neural networks operate in the time domain and are computationally expensive due to their sequential processing methods and complex structures. Despite the common perception that time-domain models outperform those in the time-frequency domain, this thesis focuses on developing resource-efficient models in the time-frequency domain, aiming to enhance their performance within a deep learning framework.

In the first contribution of this thesis, we propose RCFormer, a Conformer-based neural network with a redundancy approach, designed for monaural two-speaker speech separation. The RCFormer employs multiple pairs of intra-frame and sub-band Conformer blocks to successively capture both frame-level and sub-band-level information from the input spectrogram. To address the challenge of sparse information in the input spectrogram, a redundancy approach is introduced to create a denser representation by stacking the input spectrogram embeddings. The proposed architecture integrates Conformer blocks between a dilated dense convolutional encoder and decoder, with the Conformer block outputs fed into a masking module that generates masks to filter the encoder outputs, which are then transformed into separated speech signals via the decoder. Extensive experiments demonstrate that RCFormer achieves competitive, and often superior, performance compared to existing state-of-the-art methods across all evaluation metrics, while also featuring significantly fewer trainable parameters.

While many models achieve competitive performance with fewer trainable parameters, few researchers have addressed the computational workload and processing time associated with these models. In the second contribution of this thesis, we propose FSBNet for two-speaker speech separation, which integrates sub-band and full-band modules. FSBNet consists of an encoder, multiple full-band and sub-band blocks (FSB blocks), and a decoder. The FSB block features a sub-band module that extracts temporal information within each sub-band and computes high-level crossband dependencies through compact latent summaries, and a full-band module that captures longrange dependencies across the entire spectrogram using a self-attention mechanism. The contextual information obtained from the FSB blocks is then processed into two complex spectrograms representing the separated speech signals, which are re-synthesized into audio using the inverse short-time Fourier transform (ISTFT). Experimental results demonstrate that FSBNet achieves competitive performance compared to both time-domain and time frequency domain approaches, with significant improvements in model size reduction and processing time efficiency. Notably, this architecture outperforms most efficient time-domain models for the first time since 2019.

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science > Electrical and Computer Engineering
Item Type:Thesis (Masters)
Authors:Shi, Peiran
Institution:Concordia University
Degree Name:M.A. Sc.
Program:Electrical and Computer Engineering
Date:15 August 2024
Thesis Supervisor(s):Zhu, Wei-Ping and Ravanelli, Mirco
ID Code:994483
Deposited By: Peiran Shi
Deposited On:24 Oct 2024 16:50
Last Modified:24 Oct 2024 16:50
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Research related to the current document (at the CORE website)
- Research related to the current document (at the CORE website)
Back to top Back to top