Deep Learning Approaches for Speech Enhancement Toward Robust ASR

Title:

Deep Learning Approaches for Speech Enhancement Toward Robust ASR

Shen, Xingyu ORCID: https://orcid.org/0009-0009-6581-6055 (2026) Deep Learning Approaches for Speech Enhancement Toward Robust ASR. PhD thesis, Concordia University.

Preview

Text (application/pdf)
Shen_PhD_S2026.pdf - Accepted Version
Available under License Spectrum Terms of Access.

7MB

Abstract

Speech enhancement (SE) aims to suppress noise and reverberation while preserving speech information, improving quality, intelligibility, and automatic speech recognition (ASR) robustness. In multichannel speech enhancement (MCSE), microphone arrays provide spatial diversity for filtering and dereverberation. Classical methods such as minimum variance distortionless response (MVDR) beamforming are principled and interpretable, but depend on reliable spatial statistics and array assumptions; their performance can degrade under covariance-estimation error, array perturbation, nonstationary interference, and deployment mismatch. Deep learning improves spectro-temporal modeling and learnable spatial processing, yet designing MCSE front ends that generalize across array geometries, model long context, and reliably benefit frozen ASR back ends remains difficult.

This thesis develops short-time Fourier transform (STFT)-domain deep-learning approaches for robust and deployable speech enhancement. Four contributions are presented. First, topology-robust spatial front ends represent microphones as graph nodes and learn inter-channel interactions in the complex STFT domain. The work progresses from complex-valued graph convolution to graph-attention-based convex spatial combining with real, nonnegative, sum-to-one channel weights. A covariance-free inference variant uses MVDR-inspired teacher distributions and learnable temperature scaling to reduce sensitivity to covariance estimation.

Second, physics-informed and lightweight spatial-filtering architectures improve robustness and efficiency. These include compact dynamic spatial filtering with residual spectral mapping, an end-to-end MVDR-inspired framework with physics-inspired regularization and residual refinement, and a non-learned multi-band relative contrastive loss that aligns training with perceptual band structure without increasing inference cost.

Third, a dual-path state-space framework with cross-domain interaction improves long-context modeling in noisy and reverberant conditions. It captures short- and long-range dependencies at practical computational cost while coordinating magnitude restoration with complex-spectrum refinement.

Finally, this thesis studies recognition-compatible ASR front ends under frozen recognizers. A Parallel Time-Band Mixer (PTBM) models intra-band temporal context and cross-band structure without within-block recurrence, while learned observation fusion (LOF) adaptively combines noisy and enhanced complex spectra to reduce ASR-sensitive artifacts without development-set coefficient tuning.

Experiments on simulated and real noisy and reverberant benchmarks show consistent improvements in speech quality, intelligibility, robustness to geometry variation and array perturbation, and favorable computational efficiency. With frozen recognizers, the proposed front ends also improve downstream recognition accuracy.

Divisions:	Concordia University > Gina Cody School of Engineering and Computer Science > Electrical and Computer Engineering
Item Type:	Thesis (PhD)
Authors:	Shen, Xingyu
Institution:	Concordia University
Degree Name:	Ph. D.
Program:	Electrical and Computer Engineering
Date:	25 April 2026
Thesis Supervisor(s):	Zhu, Wei-Ping
ID Code:	997141
Deposited By:	Xingyu Shen
Deposited On:	29 Jun 2026 17:36
Last Modified:	29 Jun 2026 17:36

Repository Staff Only: item control page

Download Statistics

Downloads per month over past year

Research related to the current document (at the CORE website)

Spectrum Research Repository

Deep Learning Approaches for Speech Enhancement Toward Robust ASR

Deep Learning Approaches for Speech Enhancement Toward Robust ASR

Abstract