Ding, Jie (1999) Automatic classification of multi-lingual documents. Masters thesis, Concordia University.
Preview |
Text (application/pdf)
3MBMQ39111.pdf |
Abstract
Language classification (LC) refers to the categorization of text documents into different natural language groups, whereas language identification (LI) determines the language used in a document. LC and LI play important roles in document processing systems, because they can perform initial classifications to reduce the scope for subsequent stages of processing. Two major parts of the work are: (1) LC of documents written in 24 languages into two language categories (oriental and European), and (2) LI of oriental documents into Chinese, Japanese and Korean. This thesis concentrates on the exploration of statistical features that can contribute to LC/LI, as well as the design and implementation of programs to differentiate between documents printed in various natural languages. A total of six distinctive features are proposed and used in this study. For LC, three features are used: horizontal projection profiles, height distributions of connected components (CC) and enclosing structure of connected components. Experimental results show that we are able to classify the script of a document as either European or Asian based on four 50-CCs and obtain a high recognition rate while maintaining a low rejection rate. In the LI of oriental documents, the complexity of structure, Korean "circles" and vertical strokes have been chosen for distinguishing features between the three language scripts. The identification has been made according to the range of values in these features, and also by K-means clustering. When applied to seven hundred documents in the CENPARMI Lab, the recognition rates achieved in LC and LI have exceeded 95% and 94%, with error rates that are below 2% and 4.5%, respectively
Divisions: | Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering |
---|---|
Item Type: | Thesis (Masters) |
Authors: | Ding, Jie |
Pagination: | ix, 79 leaves : ill. ; 29 cm. |
Institution: | Concordia University |
Degree Name: | M. Comp. Sc. |
Program: | Computer Science and Software Engineering |
Date: | 1999 |
Thesis Supervisor(s): | Suen, Ching Y |
Identification Number: | QA 76.9 N38D56 1999 |
ID Code: | 736 |
Deposited By: | Concordia University Library |
Deposited On: | 27 Aug 2009 17:13 |
Last Modified: | 13 Jul 2020 19:47 |
Related URLs: |
Repository Staff Only: item control page