What's
next? (Search Engines and Web Indexes)
What
is a search engine?
| What's a 'bot? | What's
the index for? | How
does a search engine list web sites?
| What's the best search engine?
| What do I use? | What
are ways to make my search more effective?
| What are the most
popular search utilities? | Specialized
Search Engines | For more information
| Assignments
In this lesson we will discuss how search
engines work in general terms, not all possible scenarios (or search algorithms!).
What
is a search engine really and how does it work?
What we think of as a search engine is really
a team effort. There are 3 "members" of the team -- a mechanism that
identifies web pages to be included in the database, a mechanism
that indexes the sites and a searching mechanism with an interface, which
scans, for keywords within the index. Users search the index (and hence,
the database or web documents) through a query box or a template.
Documents in which the search terms occur are presented as "hits."
Although some facilities incorporate "natural
language" searching (searching by asking a question "Where are the doughnuts?"),
most search tools retrieve "hits" or "matches" by seeking occurrences of
your search terms within its database and by attempting to match the terms
(converted to a "string" of data bits) against its index. Because the terms
are converted to a digital string, the search engine must somehow be instructed
to include plurals and alternate forms of a term
| Note:
although some search
tools automatically include plurals, many do not. If you are interested
in "dogs," search for "dog or dogs" or use a wildcardsuch
as * (a wildcard is a typed symbol which simply
means "put any character here"). Some search engines allow "stemming."
This involves using a special character symbol which simply means
"put any ending here after this point". An example: the term comput&
(where &=stemming symbol) would bring up hits from the following words:
computer,
computers, computing, computation etc.
|
What's
a 'bot?
A 'bot, otherwise known as an intelligent agent,
spider, crawler, robot, or worm, is an automated device (software) which
may be programmed to search for terms (data "strings") matching certain
criteria. In terms of web search engines, a 'bot identifies and notes the
url's of web pages to be included in the database. Later, another 'bot
comes along and works on the interiors of the web documents, recording
occurrences of words and their position within the text. This information
is used to create a huge index. 'Bots travel along the links of a web site,
that is, they crawl or traverse from one hypertext link to another.
What's
the index for?
The index is how the search engine locates the
url's which match your request. The web documents containing the query
keywords are presented as a listing which may include a brief summary of
the site. A simple way to understand the index is to think of it as a computerized
book index. To discover where a topic occurs in a book, we would look up
the word in the index which would indicate the page number(s) where the
term occurs. Now imagine that every single word is included in the book
index. A computerized version might be represented like this:
|
Keyword
|
Number of times
keyword occurs in book
|
Position(s)
in book of keyword
|
Page number(s)
|
|
Apple
|
175
|
title page, page 1: first paragraph
word #5, page 2: first paragraph word #20, second paragraph word
#15, page 5: 2nd paragraph word 21,...etc. etc., in summary |
title page, table of contents,
pages: 1,2,5 etc.,12,25. |
|
Orange
|
22
|
table of contents, page 3, first paragraph
word #3, page 17; first paragraph word #30, page 21 etc. |
table of contents,
pages 3,17,21 etc. |
|
Grape
|
3
|
page 50, 2nd paragraph, word #18,
page 52, 1st paragraph word #41, page 53, 1st paragraph word
#4
|
pages 50, 52, 53 |
Some immediate observations
might include:
-
a) the word apple occurs a lot in the database
-
b) the word apple occurs in the title
-
c) the words apple and orange occur in the table
of contents
-
d) the word grape does not occur in the title
or table of contents.
A search engine uses its index to retrieve web
documents in which your search terms occur. The index lists the term and
where it occurs (the url or address of the web page) much like a book index.
|
Remember: a
search engine returns hits
only from its own database, that is,
web pages which it has indexed. So if the site you are looking for has
not yet been indexed, it won't be in the results listing no matter how
magnificent your search strategy or statement.
|
How
does a search engine decide how to list web sites matching my search terms?
Each search engine uses a different algorithm
or method to calculate something called a "relevance" which it "ranks."
Have you ever noticed the numbers which sometimes appear next to the url's
in a listing of search results? This is the "relevance ranking." Relevance
means the probability that the "hit" or "match" is on-target with your
query. The creators of search engines change the way they calculate
relevance and do not tell us mere users their methodology; being high in
the major search engines' rankings on a topic means big business.
Sometimes Web site owners try to skew the
odds of appearing on the first page of "results" for folks searching specific
keywords. Being on the first page or in the top results increases the likelihood
that the site will be seen and hence selected by the user. Unscrupulous
folks "spam" the search engine to try to improve their rankings (and hence,
their Web-based business) in a variety of methods including using "invisible
text" (where text is colored the same as a background) or repeatedly using
keywords in "meta-tags" (descriptive information not usually seen by the
user unless when viewing the "page source" -- seel below).
Perhaps most unsettling is the rising trend
of some search engines which in effect, sell higher ratings to companies
willing to pay for the priviledge. Most users will be unaware that the
set of search results has in effect been manipulated to boost these companies
ratings artificially.
Exactly how relevance
is calculated is protected, proprietary information but it is important
to be aware that search engine providers may have alliances or agreements
with other businesses (reciprocal and/or financial) which may affect
search results. |
In general however, relevance is calculated
by noting where the term occurs within the text and assigning this position
a "weight" or level of importance. Some search utilties also include a
popularity element in calculating the relevance algorithm; that is, the
more a site is linked to or used, the higher the rating. Search terms
occurring in the title, summary, in key positions within a paragraph or
appearing several times within a paragraph usually carry more "weight"
because there is a higher probability that terms in these positions indicate
significant material on the topic.
This is very similar to our book index example
above; because the term
apple occurs many times and in key
positions (title, table of contents, beginning of paragraphs) there is
a high probability that the document contains significant information about
apple.
Note that orange also occurs in the table of contents, an
indication of the term's relative importance (it is a significant topic,
but not as important as apple). The algorithm of the search
engine and the methodology it uses to calculate relevance emulate the observations
and judgments we make based on our experience. A search engine will
return the terms in our book index as hits when the search terms
apple
and grape are requested whereas a human might judge that
although the two terms occur within the document, there is no significant
relationship
between them and is hence irrelevant.
Some search engines look only in certain fields
to index documents such as the title field, first paragraph and in something
called "meta-tags." Meta-tags allow the creator of a web site to
add descriptive keywords which are not displayed in the actual web documents;
they are specifically to enhance retrieval of the document. As people "spam"
the search engine (for example, by repeating terms over and over again)
meta-tags are decreasing in importance because the folks that program the
'bots train them to overlook repetitions and other clues to "spamming."
| Note: because
each search engine assigns relevancy rankings differently, if you execute
exactly the same search in several search engines you will have different
results in terms of how and where the url's are listed (even if the database
contents are identical). |
What's
the best search engine?
I'm sure I'm going to disappoint a lot of folks
by giving the answer "the best search engine is the one that fits
the task" instead of recommending a particular utility. Until
you have some experience with knowledge seeking tools and importantly,
with identifying your real information need (for example, a query
on "Leonardo di Vinci's Mona Lisa" is likely to be more successful than
"that lady with the smile by a Renaissance artist" ( or simply "di Vinci")
or "dosage and usage guidelines for St. John's Wort" as opposed to "St.
John's Wort") it may be difficult to ascertain which tool is best for your
purpose. But the good news is, you will make better choices with
experience.
What
do I use? well, that depends....
Remember I am a librarian in an academic (college)
library, so I never know what the next information request will be (that's
the fun part!). But this means in practical terms that I am looking for
information in a variety of places, which precludes having a standard game
plan..... here's a few of my search tactics/favorite tools:
-
for general use, I use Altavista
(http://www.altavista.com). It's fast, returns good hits and is accurate.
Plus its database is huge (alternates with Hotbot as the largest web database).
Altavista also has a nice refine feature for weeding out irrelevant hits.
-
for quick queries where I want precision (accuracy)
in results I'll use Google.com (http://www.google.com).
It's fast and uncannily accurate. The "cached" page feature is useful when
the actual Web site is busy and unreachable.
-
increasingly I find myself using "Ask
Jeeves" (http://ask.com) because the knowledge base gives me some good
ideas. If I need to find out the "literature" (i.e. what's available on
the Web) of a subject or find quick definition, I'll go here first.
-
for searching by domain (.edu, .com, .gov) I
use Hotbot (http://www.hotbot.com).
I also use Hotbot for field searching since it has a nice template in its
"Advanced Search."
-
for specific subjects (rather than a specific
query), I might use a specialized directory or search engine particularly
in the Arts, Education and Health.
-
I tend to stay away from meta-search engines
(which search multiple search engines at once) because they strip away
my Boolean or field commands. I would however, recommend them for general
searches where advanced searching techniques will not be used. I also use
them for comparing results of a search across several seach engines.
-
if I want to group my hits by related topics
or more "academic" type of research, I use NorthernLight
(http://www.northernlight.com).
-
if I want to use concept searching, (find a good
web site and then look for others using the same criteria) I use Excite
(http://www.excite.com)
or Google (http://www.google.com)
-
frequently I will change tactics in mid search
-- if I get too many hits, I'll weed a few out. If I do not find anything
relevant, I'll switch to a different source and/or modify my "search statement"
or keywords.
More search
tips in Lesson 4!
What
are simple ways to make my search more effective?
A very effective way to increase the relevance
or precision of "hits" is to search as a phrase. In most cases simply
means putting quotation marks around the search terms. "Red socks"
is a different search than red socks in most search engines. What
you are actually doing by searching as a phrase is using the concept of
proximity
which concerns the terms' physical closeness to one another (that is, their
proximity).
A document with red socks occurring close or next to each other
are more likely to be on target than a document with
red in the
title and socks buried in the text.
Another way to increase your search effectiveness
is to be as specific as possible; that is including as many terms and synonyms
as you can think of to fully describe your topic. Instead of
women and computers
try
(woman or women) and (technology or computer)
and (training or professional development) and (barriers or problems)
Note: search utilities may not support
the use of parentheses (called nesting) in basic searches although
many support them in their "advanced" searches.
So to recap, phrase
searching and
specificity are two simple ways to
increase precision in searching.
What
are the most popular and useful search utilities? (the "major" search engines)
Ok folks. We are looking
at a sampling of search engines and describing generalities; we are not
attempting to create a definitive listing. For example, we'll be discussing
meta
search engines in Lesson 6, so you won't find them listed here.
-
Alta Vista (http://av.com)
Originally developed by Digital Equipment
Corporation, Alta Vista searches the Web and Usenet.
In its very large database, both simple
and advanced searching are supported with the ability to limit searches
to select portions of web documents. For example, it is possible to limit
searches to title, domains, images and links within Web documents and by
particular newsgroups or subjects in Usenet. Also, ability to browse by
subject (although this is rather slow).
-
Excite(http://www.excite.com)
Search site featuring a very large database
and a lot of "extras" such as: Excite Channels (guide to sites by subject),
stock quotes, news, tv and searching of Newsgroups. Offers concept
searching.
-
HotBot (http://www.hotbot.com)
Voted no. 1 among search engines by PC Magazine,
Hot Bot offers a sophisticated interface with a vast array of options such
as: searching by dates, by certain domains in the U.S. (e.g. .com, .org,
.edu, .gov), by media type (e.g. image, audio, video). Also, a huge
database, powerful advanced searching options, access to other search tools
by type and a subject guide.
There
are more "major search engines" for you to evaluate in Assignments
Specialized
Search Engines and Collections:
Specialized search engines are most often programmed
to "collect" web documents along a topical theme. For example, in the Arts,
Science, Health-related topics or even more specialized subjects such as
Ancient History of the Mediterranean.
Also fitting in this category are "search
tools" that really calculate rather than retrieve information (such as
those fitting in the "distance between two points" or "salary differential"
categories). Since it is impossible to list specific tools here, the following
are sites which group or list subject specific search engines or tools:
-
Beaucoup
(http://www.beaucoup.com)
Beaucoup is a collection of approximately
1000 search engines, directories and indices from all over the world, organized
into categories such as: General Searchers, Reviewed Sites/What's New,
Software, Reference, Education, Art/Graphics, Social/Environmental/Political
Concerns, and Consumer Medicine. Good starting point for popular subjects.
-
Internet
Search Engine Collections
(http://library.albany.edu/internet/engines.html#collections)
from the University of Albany by Laura
Cohen.
For
more information visit these excellent sites: