Login | Register

Automatic classification of multi-lingual documents


Automatic classification of multi-lingual documents

Ding, Jie (1999) Automatic classification of multi-lingual documents. Masters thesis, Concordia University.



Language classification (LC) refers to the categorization of text documents into different natural language groups, whereas language identification (LI) determines the language used in a document. LC and LI play important roles in document processing systems, because they can perform initial classifications to reduce the scope for subsequent stages of processing. Two major parts of the work are: (1) LC of documents written in 24 languages into two language categories (oriental and European), and (2) LI of oriental documents into Chinese, Japanese and Korean. This thesis concentrates on the exploration of statistical features that can contribute to LC/LI, as well as the design and implementation of programs to differentiate between documents printed in various natural languages. A total of six distinctive features are proposed and used in this study. For LC, three features are used: horizontal projection profiles, height distributions of connected components (CC) and enclosing structure of connected components. Experimental results show that we are able to classify the script of a document as either European or Asian based on four 50-CCs and obtain a high recognition rate while maintaining a low rejection rate. In the LI of oriental documents, the complexity of structure, Korean "circles" and vertical strokes have been chosen for distinguishing features between the three language scripts. The identification has been made according to the range of values in these features, and also by K-means clustering. When applied to seven hundred documents in the CENPARMI Lab, the recognition rates achieved in LC and LI have exceeded 95% and 94%, with error rates that are below 2% and 4.5%, respectively

Divisions:Concordia University > Faculty of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:Thesis (Masters)
Authors:Ding, Jie
Pagination:ix, 79 leaves : ill. ; 29 cm.
Institution:Concordia University
Degree Name:Theses (M.Comp.Sc.)
Program:Computer Science and Software Engineering
Thesis Supervisor(s):Suen, Ching Y
ID Code:736
Deposited By: Concordia University Libraries
Deposited On:27 Aug 2009 17:13
Last Modified:08 Dec 2010 15:16
Related URLs:
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page


Downloads per month over past year

Back to top Back to top