Mehrbod, Paria (2025) Robustness and Safety in AI Systems: Adaptive Quantile Recalibration for Test-Time Adaptation and Mechanistic Interpretability of LLM Jailbreaking. Masters thesis, Concordia University.
Preview |
Text (application/pdf)
3MBPariaMehrbod_MA_S2026.pdf - Accepted Version Available under License Spectrum Terms of Access. |
Abstract
Ensuring AI systems' stability and safety during deployment is a major challenge in machine learning. Distribution shifts in data and adversarial attacks can hurt model performance and undermine the reliability of AI applications. This thesis focuses on two primary themes: enhancing model robustness during distribution shifts at inference and interpreting source of vulnerabilities in large language models.
The first contribution addresses test-time adaptation for image classifiers. In real-world settings, models often face data that differ from the training distribution, leading to performance degradation. To address this, we introduce Adaptive Quantile Recalibration (AQR), a non-parametric method that adjusts the model’s internal feature distribution to better match those computed when observing source/training data. Specifically, AQR leverages pre-computed quantile estimations from the source distribution and recalibrates features extracted from incoming test samples so that their distribution aligns with the source feature distribution. This adaptation occurs entirely at inference time, requiring no gradient updates or model retraining.
The method is architecture-agnostic and is applicable to both convolutional neural networks and Vision Transformers. Experiments on standard robustness benchmarks including CIFAR-10/100-C and ImageNet-C AQR consistently reduces classification error compared to unadapted models and competitive baseline methods.
The second contribution investigates safety vulnerabilities in large language models through mechanistic interpretability. Even with safety training in place, LLMs continue to be vulnerable to jailbreaking attacks that can produce harmful outputs. This thesis explores circuit discovery techniques to identify specific subnetworks that play a role in allowing jailbreaking behavior. To this end, we scale circuit-discovery methods (edge attribution patching and subnetwork probing) to LLaMA-2-7B-chat and show that circuits as sparse as ~5% of model edges can be found and attributed to this behavior. Ablating the circuit by zeroing activations at its edges during the forward pass on first token generation, reduces jailbreaking attack success by ~30% on harmful prompts.
Together, these contributions advance the understanding of robustness and safety in AI systems, providing practical methods for improving model reliability during deployment and interpretable insights into vulnerability patterns in large language models.
| Divisions: | Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering |
|---|---|
| Item Type: | Thesis (Masters) |
| Authors: | Mehrbod, Paria |
| Institution: | Concordia University |
| Degree Name: | M. Comp. Sc. |
| Program: | Computer Science |
| Date: | 29 October 2025 |
| Thesis Supervisor(s): | Belilovsky, Eugene and Wolf, Guy |
| Keywords: | test-time adaptation, domain adaptation, domain shift, test-time distribution shift, mechanistic interpretability, Jailbreaking |
| ID Code: | 996489 |
| Deposited By: | Paria Mehrbod |
| Deposited On: | 29 Jun 2026 14:57 |
| Last Modified: | 29 Jun 2026 14:57 |
Repository Staff Only: item control page


Download Statistics
Download Statistics