Nasiri Sarvi, Ali (2025) Efficient and Interpretable Representations: From Medical Representation Learning to Vision-Language Multimodal Representation Engineering. Masters thesis, Concordia University.
Preview |
Text (application/pdf)
67MBNasiriSarvi_MSc_F2025.pdf - Accepted Version Available under License Spectrum Terms of Access. |
Abstract
Visual representation learning has achieved remarkable progress on natural image benchmarks, but faces critical challenges when deployed in specialized domains like medical imaging. This thesis addresses two interconnected problems: developing efficient architectures that maintain performance while aligning with domain expertise, and creating scalable frameworks for understanding what foundation models learn across different architectures.
We first investigate Vision Mamba architectures for medical applications. For histopathology, we adapt Vision Mamba within the DINO self-supervised learning framework, achieving an 8.21 AUC point improvement over Vision Transformers with comparable parameters on lymph node metastasis detection. Explainability analysis reveals that Vision Mamba focuses on diagnostically relevant cellular features, suggesting better alignment with clinical workflows. For breast ultrasound classification, we demonstrate through transfer learning that Mamba-based architectures achieve statistically significant improvements, with comprehensive analysis showing they are never significantly outperformed by traditional CNN or Vision Transformer baselines.
Our interpretability analysis of pathology foundation models using sparse autoencoders reveals a fundamental scalability problem: each model produces incompatible latent spaces that require separate expert analysis, creating exponential scaling in interpretability effort as foundation models proliferate. To address this limitation, we develop SPARC, a unified framework that enables interpretability analysis across multiple models simultaneously. SPARC introduces a Global TopK mechanism ensuring identical latent dimensions activate across models, and cross-reconstruction loss enforcing semantic consistency. Our evaluation demonstrates substantial improvements, achieving 84.4% neurons active across all streams compared to 43.6% with traditional approaches, and enabling new capabilities like text-guided spatial attention in vision-only models.
This work contributes efficient architectures for medical applications, identifies fundamental limitations in current interpretability paradigms, and provides a scalable solution that transforms cross-model interpretability from an exponentially scaling manual process into a systematic, unified approach. The results have implications for both medical AI deployment and broader interpretability research as foundation models continue to proliferate across specialized domains.
| Divisions: | Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering |
|---|---|
| Item Type: | Thesis (Masters) |
| Authors: | Nasiri Sarvi, Ali |
| Institution: | Concordia University |
| Degree Name: | M. Comp. Sc. |
| Program: | Computer Science |
| Date: | 18 July 2025 |
| Thesis Supervisor(s): | Hosseini, Mahdi S. and Hassan, Rivaz |
| ID Code: | 995930 |
| Deposited By: | Ali Nasiri Sarvi |
| Deposited On: | 04 Nov 2025 15:40 |
| Last Modified: | 04 Nov 2025 15:40 |
Repository Staff Only: item control page


Download Statistics
Download Statistics