Breadcrumb

 
 

Automatic classification of multi-lingual documents

Title:

Automatic classification of multi-lingual documents

Ding, Jie (1999) Automatic classification of multi-lingual documents. Masters thesis, Concordia University.

[img]
Preview
PDF
3153Kb

Abstract

Language classification (LC) refers to the categorization of text documents into different natural language groups, whereas language identification (LI) determines the language used in a document. LC and LI play important roles in document processing systems, because they can perform initial classifications to reduce the scope for subsequent stages of processing. Two major parts of the work are: (1) LC of documents written in 24 languages into two language categories (oriental and European), and (2) LI of oriental documents into Chinese, Japanese and Korean. This thesis concentrates on the exploration of statistical features that can contribute to LC/LI, as well as the design and implementation of programs to differentiate between documents printed in various natural languages. A total of six distinctive features are proposed and used in this study. For LC, three features are used: horizontal projection profiles, height distributions of connected components (CC) and enclosing structure of connected components. Experimental results show that we are able to classify the script of a document as either European or Asian based on four 50-CCs and obtain a high recognition rate while maintaining a low rejection rate. In the LI of oriental documents, the complexity of structure, Korean "circles" and vertical strokes have been chosen for distinguishing features between the three language scripts. The identification has been made according to the range of values in these features, and also by K-means clustering. When applied to seven hundred documents in the CENPARMI Lab, the recognition rates achieved in LC and LI have exceeded 95% and 94%, with error rates that are below 2% and 4.5%, respectively

Divisions:Concordia University > Faculty of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:Thesis (Masters)
Authors:Ding, Jie
Pagination:ix, 79 leaves : ill. ; 29 cm.
Institution:Concordia University
Degree Name:Theses (M.Comp.Sc.)
Program:Computer Science and Software Engineering
Date:1999
Thesis Supervisor(s):Suen, Ching Y
ID Code:736
Deposited By:Concordia University Libraries
Deposited On:27 Aug 2009 13:13
Last Modified:08 Dec 2010 10:16
Related URLs:
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Document Downloads

More statistics for this item...

Concordia University - Footer