HtmlSearch Documentation: Overview

Why use HtmlSearch

The HtmlSearch allows you to search through a web site that is either local (e.g. on your hard disk , a CD-ROM or a LAN) or remote (e.g. on the World Wide Web).
Many web sites have a search facility but it generally involves the execution of a CGI script that is located on the site's server.
If a site is located on a server that does not support CGI scripts, or on a local media, you cannot search it. HtmlSearch allows you to search without the assistance of CGI scripts. It is therefore useful for:

Sites hosted on a server that does not support CGI scripts

Sites that do not offer a search facility

Sites local to a hard disk or CD-ROM.

To verify that all the links on a site are reachable

Main features

works on any platform, operating system, or java-enabled browser

search a site without the use of server side indexing or CGI

start the search from any page or local file

AND and OR boolean operators

search for individual words or phrases

allow the search to explore hosts other than the starting host

ability to search HTML files or any type of file (text file, word processing document, etc...)

ability to search through HTML tags

refine searches by searching through the previous search results

ability to eliminate hosts, file types, paths, large files, directory hierarchies from search

shows broken links, inaccessible pages, unavailable hosts

ability to save search results as a text list or an HTML index

ability to index an entire site - use custom dictionary for word selection

configure which functionality and screens you want your users to be able to see and access

New in Release 2.1

corrected bug by which search would pause for a long time on certain pages

corrected a bug which forced the applet to start as 'lite' after using the help, even if the pro version was loaded

better handling for malformed HTML syntax (empty links, repeating HREFs, etc...)

many changes to the documentation

better memory management (allocation, cleanup) allows for longer searches

print the search elapsed time at the end of the search

print the applet info in the java console

smaller window size, leaving more room for the pages found display

colored start/stop button (not in Netscape 3)

reword of the domain selection options (formerly: new search/found pages/previous found pages)

add 'downLevels', identical to 'dirLevels' but for downstream directories

'downLevels' and 'dirLevels' restrict search in directories in the same path as the starting URL

better display of the found string so that it ends on a space

easier to stop search and stop index generation

using HtmlSearch as an application: CTRL-F4 kills the application

HtmlSearch is delivered as a JAR archive for faster loading

index

always uses the found pages

ability to generate a CSV file

changes in the GUI and button placement and label

choice to separate the words if several words were specified in the 'look for'

corrected bug that prevented selecting the file to save to (this is still subject to Applets security restrictions, which are particular to the browser and your settings)

the dictionary filename can be specified relative to the codeBase (the directory parent to HtmlSearchApp), or as an absolute URL.
If the file does not exist, the default dictionary (HtmlSearchApp/indexExclusion.txt) is used.

Usage

There are 2 versions of HtmlSearch:

HtmlSearch lite which is simple search engine with basic capabilities. So say you, why bother with it ? it's free !!!!. You can download it and pass it along. It's a good idea to register your copy, because we can notify you in case of bug fixes. It also allows us to know how and where HtmlSearch is used, so we can improve it.
HtmlSearch Pro gives you all the features described above. It is a shareware, so please register your copy and you'll sleep better knowing that you've done the right thing.

HtmlSearch can be used in one of 3 modes:

As part of a page in a web site: the site's designer has configured HtmlSearch to work on this web site. Depending on the choices s/he made when incorporating HtmlSearch, you may be able to search outside of that site, or not, have access to all the panels described in this help or not, etc...
As part of a web page not included in a web site: this would be the case for example if you downloaded the HtmlSearch program so you can use it on any web site or files.
As an application executed outside of a web page. This offers the advantage of unrestricted access to local or remote files and pages, but since it is not executed within a browser, you will not be able to view the pages that match your search.

There is a how-to file that explains how to set up HtmlSearch.

How it works

HtmlSearch functions very much like your browser does when you click on a link, except it does it automatically. In other words, it reads a page, looks at the links it contains, follows these links, so on and so forth.
As such, it has the same access capabilities and limitations as your browser, in addition to restrictions due to the particular nature of Java applets.
In addition it keeps the pages it examines in its own cache, so that from one search to the next it doesn't re-read pages that have already been read. (This applies only as long as you do not exit the current search session.)
The links followed are those found in hyperlinks (e.g. "A HREF" image maps, etc...). HtmlSearch does not follow links to or generated by CGI scripts, Java, or JavaScript calls.
If a link points to a directory or to an unreachable URL, HtmlSearch will try to access the files 'index.html' and 'index.htm' in that directory.
Searches can be slow if the pages searched are located on slow or busy servers, or if your modem is slow. The speed of the search also depends on your PC, and the settings in the Advanced panel.

Starting and Stopping the Search: as stated above, HtmlSearch keeps all the visited pages in its cache. This means that if you stop a search then restart it, even from a different URL or searching for a different string, HtmlSearch will first use its cache, before going back to the network to read the page. This minimizes the search time and network load, and allows for very fast searches the "second time", e.g. when searching for different strings on the same set of pages. The cache is window-specific, i.e. there is one such cache per window where HtmlSearch is loaded, and even if there are several HtmlSearch windows opened, they do not share their cache. All the searches do use the browser's caching mechanism which may also reduce access time and network load.
The caching mechanism may have a negative impact if you do many searches in the same window, since all the pages visited are kept in the cache: if you search through thousands of pages, the memory requirements may exceed the browser's capacity (this is platform, browser and browser settings dependant). It may therefore be wise to every now and then 'kill' the search and restart in a new window.

Requirements

HtmlSearch is a Java program, and you need a Java enabled browser (e.g. Microsoft Internet Explorer version 3.0 and above, Netscape version 3.0 and above). It is built on the Java 1.0 JDK and is therefore compatible with browsers using JDK 1.0 and 1.1.

Depending on your system configuration, where you got HtmlSearch from, and your browser security settings, you may be restricted as to which sites on the WWW you can search with HtmlSearch.

Registration

A shareware is a program like any other program you can buy in the store, except it is distributed using the honor system, i.e. you can try it, and if you like it, you buy it.
Besides allowing you to try things out before signing the big check, it also reduces your cost thanks to very low distribution overhead.
Even if you are using the Lite version, it is a good idea to register so we can inform you of bug fixes.
To register, please go to the MandoSoft site.

Other Search Methods

HtmlSearch needs to read each page to search it, which is very inefficient compared to CGI-based searches, not to mention the associated network load. In other words, if the site you are looking at provides a search function, you may be better off using it than HtmlSearch, especially since HtmlSearch may not be able to access all the pages of the site. On the other hand, HtmlSearch provides indexing while the site's CGI-based search may not; the site's provided search may also restrict the searches in ways that may not fit you (e.g. only some of the pages are covered by the search), or not give you the flexibility that HtmlSearch offers.
For local searches (hard disk, CD-ROM), HtmlSearch is probably slower than operating systems based tools (grep on Unix, Tools->Search on Windows), but these tools do not follow links: they operate on files only.
There are also some other Java or JavaScript based search engines available from other sources, but obviously HtmlSearch is better :-).

Top