Ding, Jie (1999) Automatic classification of multi-lingual documents. Masters thesis, Concordia University.
| PDF 3153Kb |
Abstract
Language classification (LC) refers to the categorization of text documents into different natural language groups, whereas language identification (LI) determines the language used in a document. LC and LI play important roles in document processing systems, because they can perform initial classifications to reduce the scope for subsequent stages of processing. Two major parts of the work are: (1) LC of documents written in 24 languages into two language categories (oriental and European), and (2) LI of oriental documents into Chinese, Japanese and Korean. This thesis concentrates on the exploration of statistical features that can contribute to LC/LI, as well as the design and implementation of programs to differentiate between documents printed in various natural languages. A total of six distinctive features are proposed and used in this study. For LC, three features are used: horizontal projection profiles, height distributions of connected components (CC) and enclosing structure of connected components. Experimental results show that we are able to classify the script of a document as either European or Asian based on four 50-CCs and obtain a high recognition rate while maintaining a low rejection rate. In the LI of oriental documents, the complexity of structure, Korean "circles" and vertical strokes have been chosen for distinguishing features between the three language scripts. The identification has been made according to the range of values in these features, and also by K-means clustering. When applied to seven hundred documents in the CENPARMI Lab, the recognition rates achieved in LC and LI have exceeded 95% and 94%, with error rates that are below 2% and 4.5%, respectively
| Divisions: | Concordia University > Faculty of Engineering and Computer Science > Computer Science and Software Engineering |
|---|---|
| Item Type: | Thesis (Masters) |
| Authors: | Ding, Jie |
| Pagination: | ix, 79 leaves : ill. ; 29 cm. |
| Institution: | Concordia University |
| Degree Name: | Theses (M.Comp.Sc.) |
| Program: | Computer Science and Software Engineering |
| Date: | 1999 |
| Thesis Supervisor(s): | Suen, Ching Y |
| ID Code: | 736 |
| Deposited By: | Concordia University Libraries |
| Deposited On: | 27 Aug 2009 13:13 |
| Last Modified: | 08 Dec 2010 10:16 |
| Related URLs: |
Repository Staff Only: item control page

