Login | Register

Fine-Tuning CLIP for Security Object Classification and Detection in X-ray Images

Title:

Fine-Tuning CLIP for Security Object Classification and Detection in X-ray Images

Datta, Shamita (2025) Fine-Tuning CLIP for Security Object Classification and Detection in X-ray Images. Masters thesis, Concordia University.

[thumbnail of Datta_MSc_F2025.pdf]
Text (application/pdf)
Datta_MSc_F2025.pdf - Accepted Version
Restricted to Repository staff only until 1 November 2027.
Available under License Spectrum Terms of Access.
3MB

Abstract

Security X-ray imaging is a vital tool for detecting threats and ensuring public safety. In modern security systems, computer vision and deep learning have driven major advances, particularly in object classification and detection. However, the diversity of threat objects and the limited availability of high-quality labeled X-ray data pose persistent challenges for conventional detectors. Emerging vision–language models, such as CLIP, provide a promising direction by enabling few-shot classification and detection, reducing dependence on large annotated datasets.
CLIP’s strength lies in its zero-shot capability to perform classification using textual labels as
classifiers without task-specific fine-tuning. It associates image and text features through shared representations but performs suboptimally on security X-rays, where domain-specific patterns are absent from pre-training. To overcome this limitation, we explore few-shot adaptation techniques that allow CLIP to specialize in the X-ray domain with minimal supervision, leveraging its pre-trained visual–textual foundations while introducing lightweight domain-specific fine-tuning.
This thesis investigates CLIP-based adaptation strategies for both classification and detection in X-ray imagery. For detection, CLIP is integrated with a region proposal network using Faster R-CNN to localize prohibited items, followed by fine-tuned CLIP for label assignment.
Subsequent chapters present three adaptation strategies, adapter-based fine-tuning, full model fine-tuning, and LoRA-based fine -tuning in a few-shot setting. Through systematic evaluation on benchmark X-ray datasets, we demonstrate how CLIP’s vision–language pretraining can be effectively adapted to specialized security data, achieving strong performance even with limited samples, and compare these methods using classification accuracy and average precision for classification and detection.

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:Thesis (Masters)
Authors:Datta, Shamita
Institution:Concordia University
Degree Name:M. Sc.
Program:Computer Science
Date:3 November 2025
Thesis Supervisor(s):Wang, Yang and Zuo, Xinxin
ID Code:996418
Deposited By: Shamita Datta
Deposited On:29 Jun 2026 14:55
Last Modified:29 Jun 2026 14:55
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Research related to the current document (at the CORE website)
- Research related to the current document (at the CORE website)
Back to top Back to top