Efficient Distributed Training With Subnetworks

Title:

Efficient Distributed Training With Subnetworks

Khalid, Zafir (2026) Efficient Distributed Training With Subnetworks. Masters thesis, Concordia University.

Preview

Text (application/pdf)
Khalid_MCompSc_F2026.pdf - Accepted Version
Available under License Spectrum Terms of Access.

1MB

Abstract

Pre-training large neural networks at scale places significant memory and communication demands on modern accelerators. As model sizes continue to grow faster than available device memory, efficiently distributing training across multiple hardware devices has become increasingly challenging. In practice, this challenge is shaped not only by theoretical scaling limits but also by system-level constraints such as network bandwidth, parameter redundancy, and computational overhead.\\

This thesis investigates a novel distributed training approach termed Model Parallelism with Subnetwork Data Parallelism, which introduces Subnetwork Data Parallelism (SDP) as a memory-efficient alternative to classical distributed data parallel (DDP) training. SDP partitions a model into structured, end-to-end subnetworks that are trained independently across workers without exchanging activations, thereby reducing per-device memory usage and communication costs.\\

The work examines two complementary masking regimes within the SDP framework. Backward masking applies sparsity exclusively during gradient computation, preserving unbiased gradient estimates and providing a theoretically grounded baseline, while forward masking extends sparsity to the forward pass, enabling further reductions in memory and computational cost while introducing additional regularization. In addition, two structured subnetwork construction strategies, neuron-level and block-level masking are explored across both convolutional neural networks and transformer-based architectures.\\

The proposed approach is evaluated through extensive experiments on image classification benchmarks, including CIFAR-10 and CIFAR-100, as well as large-scale language model pre-training on the FineWeb dataset. The results demonstrate that Subnetwork Data Parallelism achieves substantial memory savings while maintaining, and in some cases improving, performance relative to standard data-parallel training. These findings highlight its practicality for training large models under constrained memory budgets.

Divisions:	Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:	Thesis (Masters)
Authors:	Khalid, Zafir
Institution:	Concordia University
Degree Name:	M. Comp. Sc.
Program:	Computer Science
Date:	3 April 2026
Thesis Supervisor(s):	Belilovsky, Eugene
ID Code:	997154
Deposited By:	Zafir Khalid
Deposited On:	29 Jun 2026 14:56
Last Modified:	29 Jun 2026 14:56

Repository Staff Only: item control page

Download Statistics

Downloads per month over past year

Research related to the current document (at the CORE website)

Spectrum Research Repository

Efficient Distributed Training With Subnetworks

Efficient Distributed Training With Subnetworks

Abstract