Test: Internet Indexing Systems vs List of Known URLs: Revisited

(c) Bipin C. Desai
Department of Computer Science
1455 De Maisonneuve Blvd. West
Canada, H3G 1M8
Email: bcdesai@cs.concordia.ca
October 1997

Table 1 below presents the result obtained when a search was made in June 1995 using the then pioneering index engines. the complete details are given in:

Search System Number of Hits Number of Duplicates Number of Mis-hits Number of Items Missed
Aliweb none - - 25
DACLOD none - - 25
EINet 6 0 4 23
GNA Meta Lib. none - - 25
Harvest none - - 25
InfoSeek 7 0 0 18
Lycos 231 2 222 18
Nikos none - - 25
RBSE 8 - 8 25
W3 Catalog none - - 25
WebCrawler 7 3 0 21
WWWW 2 0 0 23
Yahoo none - - 25

Table 1: Search statistics for using the search term
Bipin (AND) Desai:
June, 1995

Many of the pioneering indexing systems, existing in mid 1995, are no longer accessible. In the meantime, a number of new systems, such as Altavista, OpenText, Hotbots etc. have emerged. Many workers in the domain of digital virtual library feel that these newer systems have addressed many of the issues we raised in Cindi System, still under active development at Concordia.[1]

The current series of tests was done in September thru October 1997 to find the number of relevant documents that would be able to located by these current search engines and evaluate the usefulness of the index entries so retrieved. Relevance of a document could be judged easily once the target set is known. We repeated the test performed in 1995 with the same search words. At the time of the test, some 325 URLs were known to contain the words "Bipin" and "Desai". These represents Web documents pertaining to the author. The complete list of these URLs is given here.

The first test, given in Table 2 below, was done on the following search engines:

For Web search, Yahoo appears to use the AltaVista engine and its database and produces almost identical result; hence we have given a single result for both search systems in Table 2.

As in the 1995 sereies of tests, we have given the result by noting the number of hits produced, the number of duplicates, number of mis-hits and and the number of relevant documents not listed in the result; we have also included a column for the number of URLs which are no longer valid. The duplicates are either the same document being served from two sites or same document listed twice. The latter errors seem to have been corrected in most search engines and they have eliminated such obvious duplicates.

The document missed could be due to the approximations used by engines such as AltaVista when it finds a large number of hits. However, the fact that these search engines could not locate all document indicates the inherent problem of isolated URLs.

The bigger problem is the lack of selectivity and a measure of usefulness of the documents found by the search engines. We have collated the result by follwing the trail of "next" set of URLs and these could be viwed by pressing on the number of hits for each search engine in Table 2. A glance at the abstract or summary presented by the search engine indicates the they are not very revealing and except for the most pedestrian need, following the pointers would result in a drain of the searchers time.

Search System Number of Hits Number of Duplicates Number of Mis-hits Number of DefunctNumber of Items Missed
AltaVista/Yahoo 97 9 23 4 264
Excite 114 10 29 7 247
Infoseek 8 2 1 1 319
Lycos 57 7 15 14 297
Hotbots 247 28 58 19155
OpenText 19 - 7 5 318

Table 2: Search statistics for using the search term
Bipin (AND) Desai
Total known URLs: 325
Sept-Oct 1997

Search statistics for using various search strategies

In this series of further tests, we used a simple search with the search terms: ' Bipin Desai', the search expressions "Bipin Desai" and "Bipin C. Desai" respectively. These tests were made only on AltaVista/Yahoo. The results of these test are given in Table 3.

The simple search shows high number of hits(4285 in the test reported here; there being a bit of variation due to AltaVista's method of abandoning a search after a sufficiently large the number of hits is made). However the simple search prodices very low selectivity and relevance. Most of the hits, in the top 160 entries are irrelevant and a large number of relevant documents are not located. Most searchers will not have the patience to go thru more than a few pages of the result: there being some 214 pages of the result for 4285 hits!

The Search expression "Bipin Desai" gives a relatively low number of hits and relevance since the author prefers to include his middle initial in the name. Most searchers may not be aware of such details.

The search expression "Bipin C. Desai" gives a relatively large number of relevant documents. Some of which are duplicates, being accessible from more than one site. Some of the defunct URLs are not deleted by the search engines pointing to the maintenance problem of the underlying database. However, this search still missed about two thirds of the documents.

As such we feel that the Semantic Header based system where the provider of the resource is responsible for generating the entry would be a more useful scheme to support discovery. In the meantime, one uses the tools at hand!

Search System Number of Hits Number of Duplicates Number of Mis-hits Number of DefunctNumber of Items Missed
AltaVista/Yahoo [1] 4285 30-40% 40-50% 5% 250+
AltaVista/ Yahoo -1 [2] 29 2 13 3 312
AltaVista/Yahoo -2 [3]128 14 - 10 221

Table 3: Search statistics for using various search strategies
Total known URLs: 325
Sept-Oct 1997

[1]Simple Search: Bipin Desai: estimated from the first 160 items in the result
[2]Advanced Search Expression ``Bipin Desai''
[3]Advanced Search Expression ``Bipin C. Desai''

[1] Bipin C. Desai, "Supporting Discovery in Virtual Libraries", Journal of the American Society of Information Science(JASIS), Vol. 48-3, pp. 190-204, March 1997.

(c) Bipin C. Desai