How do search engines work?
 
 Editor's Note
 Linux Security - 1
 What is PKI? - 2
 Search Engines
 OpenOffice.org 1.1 - 1
 Web Site Usability - 1
 A Book Review - Web  Theory: An Introduction
 CISN Archive
 Questionnaire
 Send Feedback
 
     
 

Search engines are the key to finding specific information on the vast expanse of the world wide web. Without the use of sophisticated search engines, it would be virtually impossible to locate anything on the web without knowing a specific URL, especially as the Internet grows exponentially everyday. Search engines for the general web do not really search the world wide web directly. Each one searches a database of the full text of web pages selected from the billions of web pages out there residing on servers. When you search the web using a search engine, you are always searching a somewhat stale copy of the real web page. When you click on links provided in a search engine's search results, you retrieve from the server the current version of the page.

There are basically three types of search engines: Those that are powered by crawlers or spiders; those that are powered by human submissions; and those that are a combination of the two.

Crawler-based engines send crawlers, or spiders, out into cyberspace. These crawlers visit a web site, read the information on the actual site, read the site's meta tags and also follow the links that the site connects to. The crawler returns all that information back to a central depository where the data is indexed. The crawler will periodically return to the sites to check for any information that has changed, and the frequency with which this happens is determined by the administrators of the search engine. Search engine databases are selected and built by these spiders. Although it is said they "crawl" the web in their hunt for pages to include, in truth they stay in one place. They find the pages for potential inclusion by following the links in the pages they already have in their database. They cannot think or type a URL or use judgment to "decide" to go look something up and see what's on the web about it. If a web page is never linked to in any other page, search engine spiders cannot find it. The only way a brand new page - one that no other page has ever linked to - can get into a search engine is for its URL to be sent by some human to the search engine companies as a request that the new page be included. All search engine companies offer ways to do this. After spiders find pages, they pass them on to another computer program for "indexing." This program identifies the text, links, and other content in the page and stores it in the search engine database's files so that the database can be searched by keyword and whatever more advanced approaches are offered, and the page will be found if your search matches its content.

Human-powered search engines rely on humans to submit information that is subsequently indexed and catalogued. Only information that is submitted is put into the index.

In both cases, when you query a search engine to locate information, you are actually searching through the index that the search engine has created; you are not actually searching the Web. These indices are giant databases of information that is collected and stored and subsequently searched. This explains why sometimes a search on a commercial search engine, such as Yahoo! or Google, will return results that are in fact dead links. Since the search results are based on the index, if the index hasn't been updated since a web page became invalid the search engine treats the page as still an active link even though it no longer is. It will remain that way until the index is updated.

So the same search on different search engines may produce different results. Because, not all the indices are going to be exactly the same; and it depends on what the spiders find or what the humans submitted. Moreover, not every search engine uses the same algorithm to search through the indices. The algorithm is what the search engines use to determine the relevance of the information in the index to what the user is searching for.

One of the elements that a search engine algorithm scans for is the frequency and location of keywords on a web page. Those with higher frequency are typically considered more relevant. But search engine technology is becoming sophisticated in its attempt to discourage what is known as keyword stuffing, or spamdexing. Another common element that algorithms analyze is the way that pages link to other pages in the web. By analyzing how pages link to each other, an engine can both determine what a page is about (if the keywords of the linked pages are similar to the keywords on the original page) and whether that page is considered "important" and deserving of a boost in ranking.

Recommended Search Engines: Table of Features

Search Engine Google Teoma AlltheWeb Advanced Alta Vista Advanced
Size, type
(See tests and more charts.)
HUGE. Over 2 billion. Claims over 3 billion but about 1 billion are not fully indexed (i.e., cannot be full-text word searched). LARGE. Claims to have 1 billion fully indexed, searchable pages, and 1 billion more partially indexed. HUGE. Over 3 billion fully indexed, searchable pages. LARGE, but smaller than Google or AllTheWeb.
Noteworthy features and limitations Popularity ranking using PageRank. Limit of 10 words per search, excluding OR. Indexes the first 101KB of a Web page, and 120KB of PDF's. Subject-Specific Popularity ranking. Suggests terms within results to refine. Suggests pages within results with many links. URL Investigator to find out about a page. Conversion of weights and measures. Full boolean searching and powerful searching within results using SORT BY box in Advanced Search. Basic search provides distracting commercial, paid, and directory entries.
Results Ranking Based on page popularity measured in links to it from other pages: high rank if a lot of other pages link to it. Matching and ranking based on "cached" version of pages that may not be the most recent version. Based on Subject-Specific Popularity, links to a page by related pages. Also seems to use "importance" and links to pages. In Advanced Search , SHOULD INCLUDE gives higher priority to word or phrase in box. Each box read as a phrase. In Boolean Search, rank:word is supposed to rank by that term. By the terms you specify in Sorted by box under Boolean search box. Relevancy ranked if left blank.
Language Major Romanized and non-Romanized languages. Major Romanized languages. Major Romanized and non-Romanized languages. Allows you to specify matching character sets. Extensive list includes major Romanized and non-Romanized languages.
Translation In Translate this page link following some pages. To English from major European languages. No translation. No translation. To and from English and other languages. Click on Translate following result.

In a meta-search engine, you submit keywords in its search box, and it transmits your search simultaneously to several individual search engines and their databases of web pages. Within a few seconds, you get back results from all the search engines queried. Meta-search engines do not own a database of web pages; they send your search terms to the databases maintained by search engine companies. The idea of meta-searching is much better than the reality in most cases. You would think you would save a lot of time by searching only in one place and sparing the need to use and learn several separate search engines. It depends a lot on what they search and how they organize the results. They cannot be better than the databases they query. There are two families of smarter meta-search engines:

  • Meta-searchers that search good databases, accept complex searches, integrate results well, eliminate duplicates, and offer additional features such as clustering by subjects within your search results.
  • Tools for serious digging in many resources, with powerful abilities to help you find what you seek within search results. These are appropriate for very serious researchers to use for in depth probing of a topic.

    Meta-Searchers

    Meta-Search Tool What's Searched Complex Search Ability Results Display
    Vivisimo Currently searches Netscape (Google), Lycos (FAST, similar to AllTheWeb), MSN Search (Inktomi), lii.org, and others. Can customize in Advanced Search form. Accepts and "translates" complex searches with Boolean operators and field limiting. Results accompanied with subject subdivisions based on words in search results, giving usually the major themes (Vivisimo Clustering Engine). Click on these to search within results on each theme. Can save and search within the titles, URLs, and descriptions at the bottom.
    Metacrawler & Dogpile Searches Google, Yahoo, AltaVista, Ask Jeeves, About, LookSmart, Overture, FindWhat. Accepts Boolean logic, especially in advanced search modes. Employs Vivisimo clustering technology to provide subject clusters within each search result. Dogpile allows you to see each search engine results separately, not consolidated in one list.

    Meta-Search Engines for Serious Deep Digging

    Meta-Search Tool What's Searched Complex Search Ability Results Display
    SurfWax Click My Search Sets, and select from a good list of search engines, including: AllTheWeb, AltaVista, AOL, Excite, Google, Hotbot, MSN, NBCi, OpenDirectory, Yahoo!. Can mix with educational, US Govt tools, and news sources, or many other categories. At the Free level, 3 search sets or 10 resources from a pool of 500 resources. Accepts " ", +/- . Default is AND between words. Simple searches are recommended allowing SurfWax's SiteSnaps and other features to help you dig deeply into results. Can customize in My Preferences after you join at the free or higher level. Can save searches in an InfoCubby. Results can be sorted by relevancy, A-Z by site title, or source. FocusWords from a page represent its context. Shows your words in context in the page. "ContextZooming" - more context of your search terms as found in page. Gives statistics on images and links in most pages.
    Copernic Agent Select Google and others from great list of search engines by clicking the Properties button following Advanced Search search box. Some good choices are: AltaVista, AOL, EuroSeek, Fast/AllTheWeb, Google, Hotbot, Lycos, MSN, Netscape Netcenter, Open Directory Project, Teoma, Wisenut, Yahoo! ALL, ANY, Phrase, and more. Also Boolean searching within results under Refine. Extensive help under Help menu. Integrated with Internet Explorer. Must be downloaded and installed, but Basic version is free of charge. Many advanced features, can change results display, tracks previous searches.

    References and useful links:

    Cihan YILDIRIM-YÜCEL

  •  
         
      - TOP -