End-Shape Analysis for Automatic Segmentation of Arabic Handwritten Texts

Title:

End-Shape Analysis for Automatic Segmentation of Arabic Handwritten Texts

Jamal, Amani (2015) End-Shape Analysis for Automatic Segmentation of Arabic Handwritten Texts. PhD thesis, Concordia University.

Preview

Text (application/pdf)
Jamal_PhD_F2015.pdf - Accepted Version

3MB

Abstract

Word segmentation is an important task for many methods that are related to document understanding especially word spotting and word recognition. Several approaches of word segmentation have been proposed for Latin-based languages while a few of them have been introduced for Arabic texts. The fact that Arabic writing is cursive by nature and unconstrained with no clear boundaries between the words makes the processing of Arabic handwritten text a more challenging problem.
In this thesis, the design and implementation of an End-Shape Letter (ESL) based segmentation system for Arabic handwritten text is presented. This incorporates four novel aspects: (i) removal of secondary components, (ii) baseline estimation, (iii) ESL recognition, and (iv) the creation of a new off-line CENPARMI ESL database.
Arabic texts include small connected components, also called secondary components. Removing these components can improve the performance of several systems such as baseline estimation. Thus, a robust method to remove secondary components that takes into consideration the challenges in the Arabic handwriting is introduced. The methods reconstruct the image based on some criteria. The results of this method were subsequently compared with those of two other methods that used the same database. The results show that the proposed method is effective.
Baseline estimation is a challenging task for Arabic texts since it includes ligature, overlapping, and secondary components. Therefore, we propose a learning-based approach that addresses these challenges. Our method analyzes the image and extracts baseline dependent features. Then, the baseline is estimated using a classifier.
Algorithms dealing with text segmentation usually analyze the gaps between connected components. These algorithms are based on metric calculation, finding threshold, and/or gap classification. We use two well-known metrics: bounding box and convex hull to test metric-based method on Arabic handwritten texts, and to include this technique in our approach. To determine the threshold, an unsupervised learning approach, known as the Gaussian Mixture Model, is used. Our ESL-based segmentation approach extracts the final letter of a word using rule-based technique and recognizes these letters using the implemented ESL classifier.
To demonstrate the benefit of text segmentation, a holistic word spotting system is implemented. For this system, a word recognition system is implemented. A series of experiments with different sets of features are conducted. The system shows promising results.

Divisions:	Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:	Thesis (PhD)
Authors:	Jamal, Amani
Institution:	Concordia University
Degree Name:	Ph. D.
Program:	Computer Science
Date:	30 July 2015
Thesis Supervisor(s):	Suen, Ching
ID Code:	980356
Deposited By:	AMANI JAMAL
Deposited On:	27 Oct 2015 19:39
Last Modified:	18 Jul 2019 15:30

Repository Staff Only: item control page

Download Statistics

Downloads per month over past year

Research related to the current document (at the CORE website)

Spectrum Research Repository

End-Shape Analysis for Automatic Segmentation of Arabic Handwritten Texts

End-Shape Analysis for Automatic Segmentation of Arabic Handwritten Texts

Abstract