Chen, Rui (2008) An enhanced Web robot for the CINDI system. Masters thesis, Concordia University.
Preview |
Text (application/pdf)
4MBMR40937.pdf - Accepted Version |
Abstract
With the explosion of the Web, traditional general purpose web crawlers are not sufficient for many web traversing and mining applications. Consequently, focused web crawlers are gaining attention. Focused web crawlers aim at finding web pages only related to the pre-defined topic at much less storage and computing cost. It is inherently suitable for the construction of digital libraries. As an essential part of Concordia INdexing and DIscovering system (CINDI) digital library project, CINDI Robot is a focused web crawler digging and collecting online academic and scientific documents in computer science and software engineering field. In this thesis, we discuss the details of building a multi-threaded, large-scale, intelligence-based focused web crawler, CINDI Robot. To enhance CINDI Robot, some state-of-the-arts techniques are exploited or modified to accommodate our task. The naïve Bayes classifier and the Support Vector Machine classifier are utilized to contribute to the classification; a revised context graph algorithm and a special tunneling strategy are employed to increase recall; URL ordering policies are set up to sort all crawling web pages. Other heuristics obtained during the experimental stage are also incorporated into the final version of the CINDI Robot. Finally we form a multi-level inspection infrastructure to efficiently traverse the Web. Through this multi-level inspection scheme, text features of web page contents, URL patterns and anchor texts are considered together to guide crawling processes. Our experiments demonstrate that the final version of our CINDI Robot outperforms traditional web crawlers in terms of precision, recall and crawling speed.
Divisions: | Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering |
---|---|
Item Type: | Thesis (Masters) |
Authors: | Chen, Rui |
Pagination: | x, 119 leaves : ill. ; 29 cm. |
Institution: | Concordia University |
Degree Name: | M. Comp. Sc. |
Program: | Computer Science and Software Engineering |
Date: | 2008 |
Thesis Supervisor(s): | Desai, Bipin |
Identification Number: | LE 3 C66C67M 2008 C484 |
ID Code: | 975876 |
Deposited By: | Concordia University Library |
Deposited On: | 22 Jan 2013 16:16 |
Last Modified: | 13 Jul 2020 20:08 |
Related URLs: |
Repository Staff Only: item control page