Breadcrumb

 
 

Page segmentation and identification for document image analysis

Title:

Page segmentation and identification for document image analysis

Waked, Boulos (2001) Page segmentation and identification for document image analysis. Masters thesis, Concordia University.

[img]
Preview
PDF
3842Kb

Abstract

The main objective of this thesis is to develop a system to automatically segment and label a variety of real-life documents written in different languages. The main idea is to partition the whole document into different subimages and assign to each of them one of two labels: text or non-text (including graphics); and then identify the text, as one of three categories, Roman; Ideographic, or Arabic script. The whole process consists of several steps. For instance, to detect the skew angle of the document, we use the Hough transform and the most frequently occurring local maximum. Moreover, in order to segment the page into regions, we have developed a novel approach based on diagonal scanning and node-edge orientation. Then, text and graphic components are also isolated using the geometric configuration of the connected components. Next, the textual components are segmented into lines using the projection profile: and finally the script is classified into one of the three categories mentioned above using the bounding boxes and horizontal projection. The system has been tested on 215 samples of diverse document types from many sources such as journal articles, magazines, newspapers, facsimiles, and office correspondence. These testing samples include low quality document images with different types of distortion; they also contain upside-down, skewed and low resolution images. The system classifies 93.5% of the script type correctly and 6.5% of these documents incorrectly

Divisions:Concordia University > Faculty of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:Thesis (Masters)
Authors:Waked, Boulos
Pagination:xii, 89 leaves : ill. ; 29 cm.
Institution:Concordia University
Degree Name:Theses (M.Comp.Sc.)
Program:Computer Science and Software Engineering
Date:2001
Thesis Supervisor(s):Suen, Ching Y
ID Code:1476
Deposited By:Concordia University Libraries
Deposited On:27 Aug 2009 13:19
Last Modified:08 Dec 2010 10:20
Related URLs:
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Document Downloads

More statistics for this item...

Concordia University - Footer