Login | Register

Efficient and Interpretable Representations: From Medical Representation Learning to Vision-Language Multimodal Representation Engineering

Title:

Efficient and Interpretable Representations: From Medical Representation Learning to Vision-Language Multimodal Representation Engineering

Nasiri Sarvi, Ali (2025) Efficient and Interpretable Representations: From Medical Representation Learning to Vision-Language Multimodal Representation Engineering. Masters thesis, Concordia University.

[thumbnail of NasiriSarvi_MSc_F2025.pdf]
Preview
Text (application/pdf)
NasiriSarvi_MSc_F2025.pdf - Accepted Version
Available under License Spectrum Terms of Access.
67MB

Abstract

Visual representation learning has achieved remarkable progress on natural image benchmarks, but faces critical challenges when deployed in specialized domains like medical imaging. This thesis addresses two interconnected problems: developing efficient architectures that maintain performance while aligning with domain expertise, and creating scalable frameworks for understanding what foundation models learn across different architectures.

We first investigate Vision Mamba architectures for medical applications. For histopathology, we adapt Vision Mamba within the DINO self-supervised learning framework, achieving an 8.21 AUC point improvement over Vision Transformers with comparable parameters on lymph node metastasis detection. Explainability analysis reveals that Vision Mamba focuses on diagnostically relevant cellular features, suggesting better alignment with clinical workflows. For breast ultrasound classification, we demonstrate through transfer learning that Mamba-based architectures achieve statistically significant improvements, with comprehensive analysis showing they are never significantly outperformed by traditional CNN or Vision Transformer baselines.

Our interpretability analysis of pathology foundation models using sparse autoencoders reveals a fundamental scalability problem: each model produces incompatible latent spaces that require separate expert analysis, creating exponential scaling in interpretability effort as foundation models proliferate. To address this limitation, we develop SPARC, a unified framework that enables interpretability analysis across multiple models simultaneously. SPARC introduces a Global TopK mechanism ensuring identical latent dimensions activate across models, and cross-reconstruction loss enforcing semantic consistency. Our evaluation demonstrates substantial improvements, achieving 84.4% neurons active across all streams compared to 43.6% with traditional approaches, and enabling new capabilities like text-guided spatial attention in vision-only models.

This work contributes efficient architectures for medical applications, identifies fundamental limitations in current interpretability paradigms, and provides a scalable solution that transforms cross-model interpretability from an exponentially scaling manual process into a systematic, unified approach. The results have implications for both medical AI deployment and broader interpretability research as foundation models continue to proliferate across specialized domains.

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:Thesis (Masters)
Authors:Nasiri Sarvi, Ali
Institution:Concordia University
Degree Name:M. Comp. Sc.
Program:Computer Science
Date:18 July 2025
Thesis Supervisor(s):Hosseini, Mahdi S. and Hassan, Rivaz
ID Code:995930
Deposited By: Ali Nasiri Sarvi
Deposited On:04 Nov 2025 15:40
Last Modified:04 Nov 2025 15:40
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Research related to the current document (at the CORE website)
- Research related to the current document (at the CORE website)
Back to top Back to top