Learning Contextual Vision Representations via Masking

Title:

Learning Contextual Vision Representations via Masking

Zunair, Hasib (2024) Learning Contextual Vision Representations via Masking. PhD thesis, Concordia University.

Preview

Text (application/pdf)
Zunair_PhD_S2025.pdf - Accepted Version
Available under License Spectrum Terms of Access.

17MB

Abstract

Supervised learning on large-scale labeled datasets has been critical to the success of computer vision, with widespread applications in robotics, healthcare, security, sports, and retail. To address challenges of over-dependence on labeled data, self-supervised learning aims to learn from data without annotations. However, new problems arise, such as the difficulty in defining appropriate pretext tasks, increased computational demands from multiple stages, and the need for large amounts of unlabeled data. In this thesis, we introduce a learning paradigm that models global and local context for semantic segmentation. The proposed method effectively captures pixel relationships, improving performance in ambiguous regions and better segmenting minority classes through masking. We show that our approach achieves better performance than state-of-the-art single and multi-task learning baselines in both binary and multi-class semantic segmentation tasks, particularly in tackling small, ambiguous regions in medical images and minority class instances in cluttered scenes. Motivated by the intuition that occluded objects are partial inputs, we propose a single-stage, model-agnostic approach for multi-label image recognition. The proposed method learns contextualized representations using a masked branch and models label co-occurrence through label consistency. Experimental results demonstrate the simplicity, applicability, and, more importantly, the competitive performance of our approach against previous state-of-the-art methods, especially in identifying small and occluded objects. Additionally, we propose an efficient unsupervised object localization method that can segment unfamiliar objects in images without the need for additional training, particularly when they are small, reflective, or poorly illuminated. The proposed method learns context-based representations at both the pixel- and shape-level using only a single learnable convolutional layer decoder and a frozen encoder. We demonstrate on six benchmarks datasets the simplicity, efficiency and competitive performance of our approach in both single object discovery and unsupervised salient object detection, outperforming existing methods that require intensive computational resources, extensive training, and large data volumes.

Divisions:	Concordia University > Gina Cody School of Engineering and Computer Science > Concordia Institute for Information Systems Engineering
Item Type:	Thesis (PhD)
Authors:	Zunair, Hasib
Institution:	Concordia University
Degree Name:	Ph. D.
Program:	Information and Systems Engineering
Date:	18 December 2024
Thesis Supervisor(s):	Hamza, A. Ben
Keywords:	computer vision, efficient deep learning, machine learning
ID Code:	994916
Deposited By:	Md Hasib Zunair
Deposited On:	17 Jun 2025 15:01
Last Modified:	17 Jun 2025 15:01

Repository Staff Only: item control page

Download Statistics

Downloads per month over past year

Research related to the current document (at the CORE website)

Spectrum Research Repository

Learning Contextual Vision Representations via Masking

Learning Contextual Vision Representations via Masking

Abstract