Robustness and Safety in AI Systems: Adaptive Quantile Recalibration for Test-Time Adaptation and Mechanistic Interpretability of LLM Jailbreaking

Title:

Robustness and Safety in AI Systems: Adaptive Quantile Recalibration for Test-Time Adaptation and Mechanistic Interpretability of LLM Jailbreaking

Mehrbod, Paria (2025) Robustness and Safety in AI Systems: Adaptive Quantile Recalibration for Test-Time Adaptation and Mechanistic Interpretability of LLM Jailbreaking. Masters thesis, Concordia University.

[thumbnail of PariaMehrbod_MA_S2026.pdf]

Preview

Text (application/pdf)
PariaMehrbod_MA_S2026.pdf - Accepted Version
Available under License Spectrum Terms of Access.

3MB

Abstract

Ensuring AI systems' stability and safety during deployment is a major challenge in machine learning. Distribution shifts in data and adversarial attacks can hurt model performance and undermine the reliability of AI applications. This thesis focuses on two primary themes: enhancing model robustness during distribution shifts at inference and interpreting source of vulnerabilities in large language models.
The first contribution addresses test-time adaptation for image classifiers. In real-world settings, models often face data that differ from the training distribution, leading to performance degradation. To address this, we introduce Adaptive Quantile Recalibration (AQR), a non-parametric method that adjusts the model’s internal feature distribution to better match those computed when observing source/training data. Specifically, AQR leverages pre-computed quantile estimations from the source distribution and recalibrates features extracted from incoming test samples so that their distribution aligns with the source feature distribution. This adaptation occurs entirely at inference time, requiring no gradient updates or model retraining.
The method is architecture-agnostic and is applicable to both convolutional neural networks and Vision Transformers. Experiments on standard robustness benchmarks including CIFAR-10/100-C and ImageNet-C AQR consistently reduces classification error compared to unadapted models and competitive baseline methods.
The second contribution investigates safety vulnerabilities in large language models through mechanistic interpretability. Even with safety training in place, LLMs continue to be vulnerable to jailbreaking attacks that can produce harmful outputs. This thesis explores circuit discovery techniques to identify specific subnetworks that play a role in allowing jailbreaking behavior. To this end, we scale circuit-discovery methods (edge attribution patching and subnetwork probing) to LLaMA-2-7B-chat and show that circuits as sparse as ~5% of model edges can be found and attributed to this behavior. Ablating the circuit by zeroing activations at its edges during the forward pass on first token generation, reduces jailbreaking attack success by ~30% on harmful prompts.
Together, these contributions advance the understanding of robustness and safety in AI systems, providing practical methods for improving model reliability during deployment and interpretable insights into vulnerability patterns in large language models.

Divisions:	Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:	Thesis (Masters)
Authors:	Mehrbod, Paria
Institution:	Concordia University
Degree Name:	M. Comp. Sc.
Program:	Computer Science
Date:	29 October 2025
Thesis Supervisor(s):	Belilovsky, Eugene and Wolf, Guy
Keywords:	test-time adaptation, domain adaptation, domain shift, test-time distribution shift, mechanistic interpretability, Jailbreaking
ID Code:	996489
Deposited By:	Paria Mehrbod
Deposited On:	29 Jun 2026 14:57
Last Modified:	29 Jun 2026 14:57

Repository Staff Only: item control page

Download Statistics

Downloads per month over past year

Research related to the current document (at the CORE website)

Spectrum Research Repository

Robustness and Safety in AI Systems: Adaptive Quantile Recalibration for Test-Time Adaptation and Mechanistic Interpretability of LLM Jailbreaking

Robustness and Safety in AI Systems: Adaptive Quantile Recalibration for Test-Time Adaptation and Mechanistic Interpretability of LLM Jailbreaking

Abstract