Yelluru Gopal, Goutam (2024) Exploring Convex Optimization and Transformer based Methods for Efficient Visual Object Tracking. PhD thesis, Concordia University.
Preview |
Text (application/pdf)
22MBYelluruGopal_PhD_S2024.pdf - Accepted Version Available under License Spectrum Terms of Access. |
Abstract
The effectiveness of a visual object tracking algorithm heavily relies on how well it represents the target object through a collection of feature templates, or channels. However, some channels lose their discriminative power during challenging video conditions such as target deformation, occlusion, and motion blur, leading to tracking failures. Discriminative Correlation Filter-based (DCF) trackers address these video challenges by aggregating hand-crafted and deep Convolutional Neural Network-based (CNN) channels. However, this approach increases the computational complexity of the tracker and significantly reduces inference speed, especially on constrained hardware such as a Central Processing Unit (CPU) or edge devices. We observe a parallel trend in end-to-end trainable deep Siamese Network-based (SN) trackers, which deploy parameter-heavy backbones for feature extraction and rely on specialized hardware such as a Graphics Processing Unit (GPU) for faster inference. In this thesis, we propose computationally efficient solutions to both DCF and SN tracking algorithms while improving their accuracy.
For multi-channel DCF tracking, we present three solutions to alleviate the impact of non-discriminative features (or channels). These methods leverage the concept of reliability to quantify the discriminative power of a feature (or a channel) based on its filter response. The proposed solutions dynamically lower the weightage of unreliable features (or channels) while emphasizing the temporal smoothness of the learned weights. We formulate the process of learning adaptive weights as a convex optimization problem and derive efficient solutions to maintain tracking speed. Expanding on the lightweight SN tracking paradigm, our first algorithm, MVT, employs a cascaded arrangement of CNN and transformer blocks in its backbone. This approach fuses template and search regions during feature extraction to generate superior feature encoding for target localization. Our second tracking algorithm, a Separable Self and Mixed Attention Transformer-based tracker (SMAT), further increases the efficiency of MVT by replacing the standard attention with a computationally efficient separable attention block. Proposed trackers exhibit superior performance on eight challenging benchmarks compared to the related lightweight trackers, with SMAT emerging as the top performer. The computationally efficient architecture enables our MVT and SMAT trackers to run at real-time tracking speed on a CPU, while achieving a high speed of 150 frames-per-second on a GPU.
Divisions: | Concordia University > Gina Cody School of Engineering and Computer Science > Electrical and Computer Engineering |
---|---|
Item Type: | Thesis (PhD) |
Authors: | Yelluru Gopal, Goutam |
Institution: | Concordia University |
Degree Name: | Ph. D. |
Program: | Electrical and Computer Engineering |
Date: | 27 February 2024 |
Thesis Supervisor(s): | Amer, Maria |
ID Code: | 993729 |
Deposited By: | Goutam Yelluru Gopal |
Deposited On: | 05 Jun 2024 15:27 |
Last Modified: | 05 Jun 2024 15:27 |
Repository Staff Only: item control page