Abid, Md Sayeed
ORCID: https://orcid.org/0000-0001-9770-4490
(2025)
Tramba: A Hybrid Architecture for Table Understanding.
Masters thesis, Concordia University.
Preview |
Text (application/pdf)
1MBAbid_MCompSc_F2025.pdf - Accepted Version Available under License Spectrum Terms of Access. |
Abstract
The increasing complexity and density of document images—particularly in scientific and industrial contexts—have posed significant challenges for traditional transformer-based models, due to their quadratic attention complexity and reliance on extensive computational resources. In response, this thesis proposes a novel hybrid vision architecture that integrates the Vision Mamba encoder with the Detection Transformer (DETR) framework to address the tasks of table detection and structure recognition. Leveraging Mamba’s state space modeling, which reduces computational complexity from O(N^2) to O(N), the proposed architecture retains competitive representational power while improving scalability and training efficiency. Vision Mamba is a state space sequence model designed for vision tasks, offering linear-time computation and efficient long-range dependency modeling through a bidirectional convolutional structure. DETR, in contrast, is an end-to-end object detection framework that formulates detection as a direct set prediction problem using a transformer-based encoder-decoder and learnable object queries. In our hybrid model, we replace DETR’s standard transformer encoder with a Mamba-based encoder stack, preserving the core object query mechanism while enabling lightweight and efficient sequential processing. Through extensive experiments on the PubTables-1M dataset, which is one of the largest datasets for table extraction tasks, we demonstrate that our model outperforms Faster R-CNN on both detection and structure recognition tasks, and approaches the performance of full DETR models—despite using only one-third of the encoder-decoder layers and fewer training epochs. These results highlight the architecture’s efficiency and adaptability, offering strong performance under constrained training budgets. Beyond empirical gains, the modular design of the model facilitates extensibility, including integration with large language models (LLMs) for advanced multimodal tasks such as document question answering, layout-based information retrieval, and regulatory content parsing. Finally, the lightweight nature of the Mamba encoder makes the model well-suited for deployment in enterprise-scale document processing systems, where throughput and latency are critical. This thesis thus introduces a promising direction for rethinking vision transformers through hardware-efficient sequence modeling, contributing meaningfully to the advancement of document AI and structured visual understanding. The source code is available at: github.com/SayeedAbid/Tramba
| Divisions: | Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering |
|---|---|
| Item Type: | Thesis (Masters) |
| Authors: | Abid, Md Sayeed |
| Institution: | Concordia University |
| Degree Name: | M. Comp. Sc. |
| Program: | Computer Science |
| Date: | 20 July 2025 |
| Thesis Supervisor(s): | Wang, Dr. Yang and Suen, Dr. Ching Yee |
| ID Code: | 995860 |
| Deposited By: | Md Sayeed Abid |
| Deposited On: | 04 Nov 2025 15:34 |
| Last Modified: | 05 Nov 2025 01:00 |
Repository Staff Only: item control page


Download Statistics
Download Statistics