Login | Register

Automatic semantic header generator for PDF documents


Automatic semantic header generator for PDF documents

Xue, Furong (2003) Automatic semantic header generator for PDF documents. Masters thesis, Concordia University.

[thumbnail of MQ91140.pdf]
Text (application/pdf)
MQ91140.pdf - Accepted Version


The Concordia INdexing and DIscovery system (CINDI) is an information discovery and retrieval system to enable a reader to discover resources from a bibliographic database. It uses a metadata description called semantic header to describe an information resource, whose content includes title, author name, the subject and sub-subject, etc. Automatic Semantic Header Generator (ASHG) is used to generate a draft version of the semantic header from a resource automatically. The existing system can deal with four special document formats: HTML, TEXT, LATEX, and RTF. Since more and more people use PDF for document exchange, perusal on line or in print format due to PDF document's easy to use and cross platform portability, more documents are published in PDF format. This thesis presents the design and implementation of an extension to the existing ASHG to extract the semantic header from a PDF document automatically. First, the PDF document is converted to plain text file using Xpdf, an open source software. Modification to Xpdf has been made to get better results of the conversion. In order to test the accuracy of the ASHG, 500 articles which are all from computer science field are used in an experiment to generate the semantic header; the results 80% accurate respectively. However the results reveal that the subject classification (about 41%) is the weakest point of ASHG and requiring further work.

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:Thesis (Masters)
Authors:Xue, Furong
Pagination:viii, 114 leaves : ill. ; 29 cm.
Institution:Concordia University
Degree Name:M. Comp. Sc.
Program:Dept. of Computer Science
Thesis Supervisor(s):Desai, Bipin C
Identification Number:QA 76.9 T48X84 2003
ID Code:2364
Deposited By: Concordia University Library
Deposited On:27 Aug 2009 17:27
Last Modified:13 Jul 2020 19:52
Related URLs:
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Research related to the current document (at the CORE website)
- Research related to the current document (at the CORE website)
Back to top Back to top