GPU resource allocation in Clouds

Title:

GPU resource allocation in Clouds

Sedighi, Hoda ORCID: https://orcid.org/0009-0006-6382-0796 (2026) GPU resource allocation in Clouds. PhD thesis, Concordia University.

Preview

Text (application/pdf)
Sedighi_PhD_S2026.pdf - Accepted Version
Available under License Spectrum Terms of Access.

4MB

Abstract

Cloud computing enables on-demand access to a shared pool of configurable computing resources, including Graphics Processing Units (GPUs), which are essential for accelerating compute-intensive workloads such as artificial intelligence (AI), machine learning (ML), and microservice-based applications. As GPU adoption grows in modern cloud environments, the diversity of workloads, heterogeneous resource requirements, and strict isolation demands make efficient GPU resource allocation a critical challenge. Inefficient scheduling and static allocation policies often result in GPU underutilization, performance interference, prolonged task completion times, and degraded Quality of Service (QoS).
Modern cloud workloads require dynamic and fine-grained GPU resource management to satisfy fairness, performance isolation, and latency constraints. In a multi-tenant cloud environment, GPU allocation mechanisms must enforce fairness and strong isolation to prevent interference across workloads while maintaining high utilization. Moreover, cloud-native applications, such as microservices, consist of loosely coupled, interdependent components that exhibit diverse GPU demands and dynamic execution behaviours. Inefficient resource sharing in such applications can lead to performance bottlenecks and increased communication and data-transfer overhead. Furthermore, real-time cloud services, particularly latency-sensitive AI inference workloads, require priority-based GPU allocation and scheduling to meet deadline requirements. The requirements together outline GPU resource allocation as a multi-dimensional challenge. At the cluster level, the problem concerns ensuring fairness and isolation among tenants. At the application level, the focus is on maximizing efficiency for cloud-native applications. Lastly, at the runtime level, the challenge lies in executing tasks with priority awareness under real-time latency constraints.
This thesis addresses these challenges with three key contributions for multi-tenant cloud environments, where efficient, fair, and latency-aware GPU allocation is critical. First, we propose a fairness-driven GPU allocation mechanism that enforces strong isolation among tenants while maximizing GPU utilization in shared cloud infrastructures. Second, we introduce a dynamic GPU resource allocation framework designed for microservice-based applications. This framework adapts to workload variations and inter-component dependencies to improve throughput and reduce end-to-end latency. Third, we present a priority-based GPU scheduling strategy that supports task preemption and resumption, enabling the timely execution of real-time workloads while preserving fairness.

Divisions:	Concordia University > Gina Cody School of Engineering and Computer Science > Concordia Institute for Information Systems Engineering
Item Type:	Thesis (PhD)
Authors:	Sedighi, Hoda
Institution:	Concordia University
Degree Name:	Ph. D.
Program:	Information and Systems Engineering
Date:	13 March 2026
Thesis Supervisor(s):	Glitho, Roch
ID Code:	997179
Deposited By:	Hoda Sedighi
Deposited On:	29 Jun 2026 17:54
Last Modified:	29 Jun 2026 17:54

Repository Staff Only: item control page

Download Statistics

Downloads per month over past year

Research related to the current document (at the CORE website)

Spectrum Research Repository

GPU resource allocation in Clouds

GPU resource allocation in Clouds

Abstract