Login | Register

End-Shape Analysis for Automatic Segmentation of Arabic Handwritten Texts

Title:

End-Shape Analysis for Automatic Segmentation of Arabic Handwritten Texts

Jamal, Amani (2015) End-Shape Analysis for Automatic Segmentation of Arabic Handwritten Texts. PhD thesis, Concordia University.

[img]
Preview
Text (application/pdf)
Jamal_PhD_F2015.pdf - Accepted Version
3MB

Abstract

Word segmentation is an important task for many methods that are related to document understanding especially word spotting and word recognition. Several approaches of word segmentation have been proposed for Latin-based languages while a few of them have been introduced for Arabic texts. The fact that Arabic writing is cursive by nature and unconstrained with no clear boundaries between the words makes the processing of Arabic handwritten text a more challenging problem.
In this thesis, the design and implementation of an End-Shape Letter (ESL) based segmentation system for Arabic handwritten text is presented. This incorporates four novel aspects: (i) removal of secondary components, (ii) baseline estimation, (iii) ESL recognition, and (iv) the creation of a new off-line CENPARMI ESL database.
Arabic texts include small connected components, also called secondary components. Removing these components can improve the performance of several systems such as baseline estimation. Thus, a robust method to remove secondary components that takes into consideration the challenges in the Arabic handwriting is introduced. The methods reconstruct the image based on some criteria. The results of this method were subsequently compared with those of two other methods that used the same database. The results show that the proposed method is effective.
Baseline estimation is a challenging task for Arabic texts since it includes ligature, overlapping, and secondary components. Therefore, we propose a learning-based approach that addresses these challenges. Our method analyzes the image and extracts baseline dependent features. Then, the baseline is estimated using a classifier.
Algorithms dealing with text segmentation usually analyze the gaps between connected components. These algorithms are based on metric calculation, finding threshold, and/or gap classification. We use two well-known metrics: bounding box and convex hull to test metric-based method on Arabic handwritten texts, and to include this technique in our approach. To determine the threshold, an unsupervised learning approach, known as the Gaussian Mixture Model, is used. Our ESL-based segmentation approach extracts the final letter of a word using rule-based technique and recognizes these letters using the implemented ESL classifier.
To demonstrate the benefit of text segmentation, a holistic word spotting system is implemented. For this system, a word recognition system is implemented. A series of experiments with different sets of features are conducted. The system shows promising results.

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science
Item Type:Thesis (PhD)
Authors:Jamal, Amani
Institution:Concordia University
Degree Name:Ph. D.
Program:Computer Science
Date:30 July 2015
Thesis Supervisor(s):Suen, Ching
ID Code:980356
Deposited By: AMANI JAMAL
Deposited On:27 Oct 2015 19:39
Last Modified:18 Jan 2018 17:51
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Back to top Back to top