End User's Corner - September 1996

Jack Solock Searching the Internet Part I

Some Basic Considerations and Automated Search Indexes

Jack Solock, Special Librarian

September 1996

Whenever librarians or information professionals have to think about searching the Web using search indexes like Alta Vista, HotBot, WebCrawler, or Lycos, they must cringe just a little bit. They are accustomed to using the extraordinarily powerful proprietary indexes that drive such services as Dialog, Lexis/Nexis, or H.W. Wilson's bank of bibliographic search databases. These engines are extremely powerful, allowing the user to effectively search through terabytes of data.

These powerful search capabilities are necessary in order to retrieve just the requested items. The learning curve for these tools can be steep. Since users must pay for the use of these indexes, the librarian cannot just sit down at the terminal and "play around" with different searches. He or she must rigorously construct a search strategy before ever going online. Inefficiency is far too costly when using these services.

Web searching is becoming similar to proprietary data retrieval services in that users are trying to filter through terabytes of data in order to find just what they want. However, because Internet search indexes are free, users tend to take a fairly cavalier attitude about using them, seldom taking the time to learn their features. This kind of searching may return useful results, but it may also return a frustrating mass of irrelevant information.

In this article we will define the components of automatic search indexes, discuss procedures for making the most effective use of them, explain some basic search features that all search indexes should (but do not) explicitly contain, and identify which indexes are the best from the point of view of those search features. These features are summarized in a table at the end of this column.

As we discussed in an earlier column (June 1996), automated search indexes aren't necessarily the most effective way to find useful information. Someone who has already sifted through that information can offer the most precise searching pointers. But search indexes are among the most popular sites on the net, indicating that users have a need to seek out information on their own. So let's try to make some sense of how users might best use these indexes.

First, it's important to make the distinction between an automated search index and a web directory. Automated search indexes consist of three components: a "robot" of some sort that automatically collects links, titles, and text from Internet sites; a database where the resource information is stored; and a search engine which allows the user to query the database for sites. Most search indexes have added a browsable subject directory of some sort, but these sites are still primarily used to search, not browse, the net. All of the indexes collect large numbers of links, and this can be both an advantage and a disadvantage. The advantage is that everything on the Web is waiting for you to find it. The disadvantage is that you have to know how to find it.

Subject directories like Yahoo, Magellan, Galaxy, or Point, although they can be searched, are primarily categorizations of Internet resources. They are meant to be browsed through, just as you would browse the shelves of a library. Subject directories will be the focus of next month's column.

Here we will discuss basic search index features. Eight of the more popular and powerful search indexes will be compared in terms of their support of these features. Seven have been in existence for some time, and the eighth is a new product that is in beta testing (that is, it is available to the public but is not yet in final form). They are:

Alta Vista:
http://altavista.digital.com/

Advanced search help:
http://altavista.digital.com/cgi-bin/query?pg=ah&what=web

Open Text:
http://index.opentext.net/

Search help:
http://index.opentext.net/main/help.html

WebCrawler:
http://www.webcrawler.com/

Search help:
http://www.webcrawler.com/WebCrawler/Help/Help.html

Advanced searching:
http://www.webcrawler.com/WebCrawler/Help/Advanced.html

excite:
http://www.excite.com/

Search help:
http://www.excite.com/Info/search_intro.html

Advanced Searching:
http://www.excite.com/Info/advanced.html

Infoseek Guide:
http://guide.infoseek.com/

Search help:
http://guide.infoseek.com/Help?pg=HomeHelp.html&sv=IS&lk=frames

Advanced Searching and syntax guide:
http://guide.infoseek.com/Help?pg=SearchHelp.html&sv=IS&lk=frames

Lycos:
http://www.lycos.com/

Search help:
http://www.lycos.com/help/

HotBot:
http://www.hotbot.com/

Search help:
http://www.hotbot.com/help/

Infoseek Ultra (currently in beta release):
http://ultra.infoseek.com/

Search help:
http://ultra.infoseek.com/Help?pg=help.html&sv=US&lk=1

Each of these indexes has advantages and disadvantages, and while certain ones are recommended (see "The Best" in Table 1), users should find one or two that they are comfortable with, spend some time learning the searching systems, and then practice, practice, practice.

Creating a Search Strategy

First a searcher should step away from the computer. This recommendation is a throwback to the days when searching cost money and "playing on the computer" was unheard of, but it is still valid because your search will be much more efficient if you think about what you want to search for, and write it down before you start.

Proprietary search engine workbooks suggest making a worksheet that connects the concepts you want to use before you start. For example, for information on teenage alcoholism, the two concepts to examine are:

teenage AND alcoholism

However, there are more than just two terms for these concepts. Think about what they might be.

teenage               AND        alcoholism
OR    adolescents                       AND        alcohol abuse
OR    secondary school students         AND        alcoholic beverages
OR    youth                     AND        drinking

Then combine the queries:

(teenage or adolescents or secondary school students or youth) AND (alcoholism or alcohol abuse or alcoholic beverages or drinking)

This (or a variation of it) allows you to use as many terms as possible to search for your concepts. Once you have done this, return to the computer. Now you'll want to know which search indexes can handle your query most effectively.

Since automated search indexes cover so many sites, they must contain query features that allow you to retrieve exactly the information you need. If an index contains 100 items about teenage alcoholism, ideally your query should retrieve those 100 items. You would then have everything in the database that relates to your query. A query's effectiveness in this regard is its "recall." While you want a search to deliver high recall, you also want all retrieved items to be specifically about teenage alcoholism; you don't want 100 retrievals, of which only 15 are about teenage alcoholism. A query's success in this sense it its "precision." Ideally, a query should return high recall and high precision. However, it is less frustrating to achieve high precision with less recall than to receive hundreds (or thousands) of sites, many of which may be only loosely (or not at all) connected to your query.

Here is where the syntactical tools of the trade, the search features that each index allows you to use, become crucial. There are several of these features, of which we will discuss only a few. Judging the indexes by their provision of these features is one of the most important ways to analyze which index is the best for you. Remember, whether there are 250,000, 30 million, or 50 million items behind the curtain of the search index, you need to be able to retrieve just the items of use to you.

The Basic Features

Here are some of the basic syntactical features, along with very brief explanations of how they work.

Boolean searching:

This allows terms to be put into logical groups by the use of connective terms. The basic connective terms are AND, OR, and NOT. Searching "cats OR dogs" will retrieve items containing either term. Searching "cats AND dogs" will retrieve only items containing both terms, narrowing the search. Searching "Mexico NOT New" will retrieve items about Mexico but not about New Mexico, narrowing the search in another way.

Phrase searching:

This allows searching words as phrases and can be very useful in narrowing a search. If you can find sites with the words "teenage alcoholism" as a phrase, rather than just the two words mentioned anywhere in the site, you're on your way to higher precision.

Proximity searching:

When available, proximity operators allow you to specify how many words one word is from another. The closer the words teenage and alcoholism are, the more likely they are to be pertinent to your query. The most common proximity operator is NEAR.

Truncation:

This allows you to add a wild card symbol (usually a *) at the end of a root term, in order to retrieve different variants of the term. Many search indexes do this implicitly, but you have another tool in your arsenal if you can do it explicitly. A search on "historic*" should return historic, historical, historically, etc.

Field searching:

This is probably the most important feature available for searching indexes, and it is what really separates the great ones from the good ones. A web page is a data record which can be divided into fields. Title, URL, text, summary, and heading are just a few of the fields. The more fields that can be searched, the better, because in combination, field searches increase precision dramatically. If you can search on the phrase "teenage alcohol*" in the title of web pages, and combine that with "treatment method*" in the text of the page, you can narrow the search significantly.

Analysis

The above features offer the user flexibility and power. Based on the number of features offered, Alta Vista, Open Text, and WebCrawler are the most powerful. (See Table 1--a simple comparison across the board of the search features explicitly available in each index--taken from the search index help pages.) These three indexes are the best not because they contain the most sites or return retrieved sites the fastest, but because they allow you to hone your search the most, increasing precision. Infoseek's Ultra is not listed as one of the best at this time, because it is currently in beta release, but a look at its capabilities shows that it has much promise to join this list.

What really sets Alta Vista, Open Text, and possibly Ultra apart are their field search capabilities. The producers of these indexes realize that it is crucially important to not only provide millions of web pages, but also to provide the end user with the tools to achieve precise retrieval.

Of course, you might not agree with these "best" picks; no one search index is right for every user. The point is to find an index that is comfortable for you and that provides you with the best results. Hopefully, Table 1 will help you do that.

There are inherent problems with all these search indexes. Because they cannot discriminate between pages that are at the same site, they can become the equivalent of searching a card catalog for "George Washington," and retrieving citations for every page of every book that name is listed on. Sites are often mirrored, and the index can return numerous duplicates. There is also the ever-present problem of quality. Even when you have found the indexes of choice for you, spent time learning their syntax, and have sent queries that return a manageable amount of retrieval that appears to be relevant to your query, how do you know if the sites retrieved are good sites? Information quality in the Internet environment will be discussed in a later column.

For more information on automated search indexes, along with other ways to search the Internet, see the Scout Toolkit.

http://scout.cs.wisc.edu/scout/toolkit/searching/

This article originally appeared as part of the End User's Corner, a featured column of InterNIC News, which was published monthly by Network Solutions, Inc. and InterNIC from May 1996 through March 1998. As of April 1998, End User's Corner will be published by the Internet Scout Project.

Copyright Susan Calcari and the University of Wisconsin Board of Regents, 1994-1998. Permission is granted to make and distribute verbatim copies of the End User's Corner provided the copyright notice and this paragraph is preserved on all copies. The Internet Scout Project provides information about the Internet to the US research and education community under a grant from the National Science Foundation, number NCR-9712140. The Government has certain rights in this material.

Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the author(s) and do not necessarily reflect the views of the University of Wisconsin - Madison or the National Science Foundation.

A Publication of the Internet Scout Project

Comments, Suggestions, Feedback

Use our feedback form or send email to scout@cs.wisc.edu.