Login | Register

An enhanced Web robot for the CINDI system

Title:

An enhanced Web robot for the CINDI system

Chen, Rui (2008) An enhanced Web robot for the CINDI system. Masters thesis, Concordia University.

[thumbnail of MR40937.pdf]
Preview
Text (application/pdf)
MR40937.pdf - Accepted Version
4MB

Abstract

With the explosion of the Web, traditional general purpose web crawlers are not sufficient for many web traversing and mining applications. Consequently, focused web crawlers are gaining attention. Focused web crawlers aim at finding web pages only related to the pre-defined topic at much less storage and computing cost. It is inherently suitable for the construction of digital libraries. As an essential part of Concordia INdexing and DIscovering system (CINDI) digital library project, CINDI Robot is a focused web crawler digging and collecting online academic and scientific documents in computer science and software engineering field. In this thesis, we discuss the details of building a multi-threaded, large-scale, intelligence-based focused web crawler, CINDI Robot. To enhance CINDI Robot, some state-of-the-arts techniques are exploited or modified to accommodate our task. The naïve Bayes classifier and the Support Vector Machine classifier are utilized to contribute to the classification; a revised context graph algorithm and a special tunneling strategy are employed to increase recall; URL ordering policies are set up to sort all crawling web pages. Other heuristics obtained during the experimental stage are also incorporated into the final version of the CINDI Robot. Finally we form a multi-level inspection infrastructure to efficiently traverse the Web. Through this multi-level inspection scheme, text features of web page contents, URL patterns and anchor texts are considered together to guide crawling processes. Our experiments demonstrate that the final version of our CINDI Robot outperforms traditional web crawlers in terms of precision, recall and crawling speed.

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:Thesis (Masters)
Authors:Chen, Rui
Pagination:x, 119 leaves : ill. ; 29 cm.
Institution:Concordia University
Degree Name:M. Comp. Sc.
Program:Computer Science and Software Engineering
Date:2008
Thesis Supervisor(s):Desai, Bipin
Identification Number:LE 3 C66C67M 2008 C484
ID Code:975876
Deposited By: Concordia University Library
Deposited On:22 Jan 2013 16:16
Last Modified:13 Jul 2020 20:08
Related URLs:
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Research related to the current document (at the CORE website)
- Research related to the current document (at the CORE website)
Back to top Back to top