Zunair, Hasib (2024) Learning Contextual Vision Representations via Masking. PhD thesis, Concordia University.
Preview |
Text (application/pdf)
17MBZunair_PhD_S2025.pdf - Accepted Version Available under License Spectrum Terms of Access. |
Abstract
Supervised learning on large-scale labeled datasets has been critical to the success of computer vision, with widespread applications in robotics, healthcare, security, sports, and retail. To address challenges of over-dependence on labeled data, self-supervised learning aims to learn from data without annotations. However, new problems arise, such as the difficulty in defining appropriate pretext tasks, increased computational demands from multiple stages, and the need for large amounts of unlabeled data. In this thesis, we introduce a learning paradigm that models global and local context for semantic segmentation. The proposed method effectively captures pixel relationships, improving performance in ambiguous regions and better segmenting minority classes through masking. We show that our approach achieves better performance than state-of-the-art single and multi-task learning baselines in both binary and multi-class semantic segmentation tasks, particularly in tackling small, ambiguous regions in medical images and minority class instances in cluttered scenes. Motivated by the intuition that occluded objects are partial inputs, we propose a single-stage, model-agnostic approach for multi-label image recognition. The proposed method learns contextualized representations using a masked branch and models label co-occurrence through label consistency. Experimental results demonstrate the simplicity, applicability, and, more importantly, the competitive performance of our approach against previous state-of-the-art methods, especially in identifying small and occluded objects. Additionally, we propose an efficient unsupervised object localization method that can segment unfamiliar objects in images without the need for additional training, particularly when they are small, reflective, or poorly illuminated. The proposed method learns context-based representations at both the pixel- and shape-level using only a single learnable convolutional layer decoder and a frozen encoder. We demonstrate on six benchmarks datasets the simplicity, efficiency and competitive performance of our approach in both single object discovery and unsupervised salient object detection, outperforming existing methods that require intensive computational resources, extensive training, and large data volumes.
Divisions: | Concordia University > Gina Cody School of Engineering and Computer Science > Concordia Institute for Information Systems Engineering |
---|---|
Item Type: | Thesis (PhD) |
Authors: | Zunair, Hasib |
Institution: | Concordia University |
Degree Name: | Ph. D. |
Program: | Information and Systems Engineering |
Date: | 18 December 2024 |
Thesis Supervisor(s): | Hamza, A. Ben |
Keywords: | computer vision, efficient deep learning, machine learning |
ID Code: | 994916 |
Deposited By: | Md Hasib Zunair |
Deposited On: | 17 Jun 2025 15:01 |
Last Modified: | 17 Jun 2025 15:01 |
Repository Staff Only: item control page