Login | Register

Tramba: A Hybrid Architecture for Table Understanding

Title:

Tramba: A Hybrid Architecture for Table Understanding

Abid, Md Sayeed ORCID: https://orcid.org/0000-0001-9770-4490 (2025) Tramba: A Hybrid Architecture for Table Understanding. Masters thesis, Concordia University.

[thumbnail of Abid_MCompSc_F2025.pdf]
Preview
Text (application/pdf)
Abid_MCompSc_F2025.pdf - Accepted Version
Available under License Spectrum Terms of Access.
1MB

Abstract

The increasing complexity and density of document images—particularly in scientific and industrial contexts—have posed significant challenges for traditional transformer-based models, due to their quadratic attention complexity and reliance on extensive computational resources. In response, this thesis proposes a novel hybrid vision architecture that integrates the Vision Mamba encoder with the Detection Transformer (DETR) framework to address the tasks of table detection and structure recognition. Leveraging Mamba’s state space modeling, which reduces computational complexity from O(N^2) to O(N), the proposed architecture retains competitive representational power while improving scalability and training efficiency. Vision Mamba is a state space sequence model designed for vision tasks, offering linear-time computation and efficient long-range dependency modeling through a bidirectional convolutional structure. DETR, in contrast, is an end-to-end object detection framework that formulates detection as a direct set prediction problem using a transformer-based encoder-decoder and learnable object queries. In our hybrid model, we replace DETR’s standard transformer encoder with a Mamba-based encoder stack, preserving the core object query mechanism while enabling lightweight and efficient sequential processing. Through extensive experiments on the PubTables-1M dataset, which is one of the largest datasets for table extraction tasks, we demonstrate that our model outperforms Faster R-CNN on both detection and structure recognition tasks, and approaches the performance of full DETR models—despite using only one-third of the encoder-decoder layers and fewer training epochs. These results highlight the architecture’s efficiency and adaptability, offering strong performance under constrained training budgets. Beyond empirical gains, the modular design of the model facilitates extensibility, including integration with large language models (LLMs) for advanced multimodal tasks such as document question answering, layout-based information retrieval, and regulatory content parsing. Finally, the lightweight nature of the Mamba encoder makes the model well-suited for deployment in enterprise-scale document processing systems, where throughput and latency are critical. This thesis thus introduces a promising direction for rethinking vision transformers through hardware-efficient sequence modeling, contributing meaningfully to the advancement of document AI and structured visual understanding. The source code is available at: github.com/SayeedAbid/Tramba

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:Thesis (Masters)
Authors:Abid, Md Sayeed
Institution:Concordia University
Degree Name:M. Comp. Sc.
Program:Computer Science
Date:20 July 2025
Thesis Supervisor(s):Wang, Dr. Yang and Suen, Dr. Ching Yee
ID Code:995860
Deposited By: Md Sayeed Abid
Deposited On:04 Nov 2025 15:34
Last Modified:05 Nov 2025 01:00
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Research related to the current document (at the CORE website)
- Research related to the current document (at the CORE website)
Back to top Back to top