Khalid, Zafir (2026) Efficient Distributed Training With Subnetworks. Masters thesis, Concordia University.
Preview |
Text (application/pdf)
1MBKhalid_MCompSc_F2026.pdf - Accepted Version Available under License Spectrum Terms of Access. |
Abstract
Pre-training large neural networks at scale places significant memory and communication demands on modern accelerators. As model sizes continue to grow faster than available device memory, efficiently distributing training across multiple hardware devices has become increasingly challenging. In practice, this challenge is shaped not only by theoretical scaling limits but also by system-level constraints such as network bandwidth, parameter redundancy, and computational overhead.\\
This thesis investigates a novel distributed training approach termed Model Parallelism with Subnetwork Data Parallelism, which introduces Subnetwork Data Parallelism (SDP) as a memory-efficient alternative to classical distributed data parallel (DDP) training. SDP partitions a model into structured, end-to-end subnetworks that are trained independently across workers without exchanging activations, thereby reducing per-device memory usage and communication costs.\\
The work examines two complementary masking regimes within the SDP framework. Backward masking applies sparsity exclusively during gradient computation, preserving unbiased gradient estimates and providing a theoretically grounded baseline, while forward masking extends sparsity to the forward pass, enabling further reductions in memory and computational cost while introducing additional regularization. In addition, two structured subnetwork construction strategies, neuron-level and block-level masking are explored across both convolutional neural networks and transformer-based architectures.\\
The proposed approach is evaluated through extensive experiments on image classification benchmarks, including CIFAR-10 and CIFAR-100, as well as large-scale language model pre-training on the FineWeb dataset. The results demonstrate that Subnetwork Data Parallelism achieves substantial memory savings while maintaining, and in some cases improving, performance relative to standard data-parallel training. These findings highlight its practicality for training large models under constrained memory budgets.
| Divisions: | Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering |
|---|---|
| Item Type: | Thesis (Masters) |
| Authors: | Khalid, Zafir |
| Institution: | Concordia University |
| Degree Name: | M. Comp. Sc. |
| Program: | Computer Science |
| Date: | 3 April 2026 |
| Thesis Supervisor(s): | Belilovsky, Eugene |
| ID Code: | 997154 |
| Deposited By: | Zafir Khalid |
| Deposited On: | 29 Jun 2026 14:56 |
| Last Modified: | 29 Jun 2026 14:56 |
Repository Staff Only: item control page


Download Statistics
Download Statistics