Breadcrumb

 
 

Automatic semantic header generator for PDF documents

Title:

Automatic semantic header generator for PDF documents

Xue, Furong (2003) Automatic semantic header generator for PDF documents. Masters thesis, Concordia University.

[img]
Preview
PDF - Accepted Version
10Mb

Abstract

The Concordia INdexing and DIscovery system (CINDI) is an information discovery and retrieval system to enable a reader to discover resources from a bibliographic database. It uses a metadata description called semantic header to describe an information resource, whose content includes title, author name, the subject and sub-subject, etc. Automatic Semantic Header Generator (ASHG) is used to generate a draft version of the semantic header from a resource automatically. The existing system can deal with four special document formats: HTML, TEXT, LATEX, and RTF. Since more and more people use PDF for document exchange, perusal on line or in print format due to PDF document's easy to use and cross platform portability, more documents are published in PDF format. This thesis presents the design and implementation of an extension to the existing ASHG to extract the semantic header from a PDF document automatically. First, the PDF document is converted to plain text file using Xpdf, an open source software. Modification to Xpdf has been made to get better results of the conversion. In order to test the accuracy of the ASHG, 500 articles which are all from computer science field are used in an experiment to generate the semantic header; the results 80% accurate respectively. However the results reveal that the subject classification (about 41%) is the weakest point of ASHG and requiring further work.

Divisions:Concordia University > Faculty of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:Thesis (Masters)
Authors:Xue, Furong
Pagination:viii, 114 leaves : ill. ; 29 cm.
Institution:Concordia University
Degree Name:Theses (M.Comp.Sc.)
Program:Dept. of Computer Science
Date:2003
Thesis Supervisor(s):Desai, Bipin C
ID Code:2364
Deposited By:Concordia University Libraries
Deposited On:27 Aug 2009 13:27
Last Modified:14 Dec 2012 16:10
Related URLs:
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Document Downloads

More statistics for this item...

Concordia University - Footer