HtmlSearch Documentation: Overview

Back Previous Next Return to index

Why use HtmlSearch

The HtmlSearch allows you to search through a web site that is either local (e.g. on your hard disk , a CD-ROM or a LAN) or remote (e.g. on the World Wide Web).
Many web sites have a search facility but it generally involves the execution of a CGI script that is located on the site's server.
If a site is located on a server that does not support CGI scripts, or on a local media, you cannot search it. HtmlSearch allows you to search without the assistance of CGI scripts. It is therefore useful for:
  • Sites hosted on a server that does not support CGI scripts
  • Sites that do not offer a search facility
  • Sites local to a hard disk or CD-ROM.
  • To verify that all the links on a site are reachable
  •  

    Main features

  • works on any platform, operating system, or java-enabled browser
  • search a site without the use of server side indexing or CGI
  • start the search from any page or local file
  • AND and OR boolean operators
  • search for individual words or phrases
  • allow the search to explore hosts other than the starting host
  • ability to search HTML files or any type of file (text file, word processing document, etc...)
  • ability to search through HTML tags
  • refine searches by searching through the previous search results
  • ability to eliminate hosts, file types, paths, large files, directory hierarchies from search
  • shows broken links, inaccessible pages, unavailable hosts
  • ability to save search results as a text list or an HTML index
  • ability to index an entire site - use custom dictionary for word selection
  • configure which functionality and screens you want your users to be able to see and access
  • New in Release 2.1
  • corrected bug by which search would pause for a long time on certain pages
  • corrected a bug which forced the applet to start as 'lite' after using the help, even if the pro version was loaded
  • better handling for malformed HTML syntax (empty links, repeating HREFs, etc...)
  • many changes to the documentation
  • better memory management (allocation, cleanup) allows for longer searches
  • print the search elapsed time at the end of the search
  • print the applet info in the java console
  • smaller window size, leaving more room for the pages found display
  • colored start/stop button (not in Netscape 3)
  • reword of the domain selection options (formerly: new search/found pages/previous found pages)
  • add 'downLevels', identical to 'dirLevels' but for downstream directories
  • 'downLevels' and 'dirLevels' restrict search in directories in the same path as the starting URL
  • better display of the found string so that it ends on a space
  • easier to stop search and stop index generation
  • using HtmlSearch as an application: CTRL-F4 kills the application
  • HtmlSearch is delivered as a JAR archive for faster loading
  • index
  • always uses the found pages
  • ability to generate a CSV file
  • changes in the GUI and button placement and label
  • choice to separate the words if several words were specified in the 'look for'
  • corrected bug that prevented selecting the file to save to (this is still subject to Applets security restrictions, which are particular to the browser and your settings)
  • the dictionary filename can be specified relative to the codeBase (the directory parent to HtmlSearchApp), or as an absolute URL.
    If the file does not exist, the default dictionary (HtmlSearchApp/indexExclusion.txt) is used.
  •  

    Usage

    There are 2 versions of HtmlSearch:

    HtmlSearch can be used in one of 3 modes:

    There is a how-to file that explains how to set up HtmlSearch.

     

    How it works

    HtmlSearch functions very much like your browser does when you click on a link, except it does it automatically. In other words, it reads a page, looks at the links it contains, follows these links, so on and so forth.
    As such, it has the same access capabilities and limitations as your browser, in addition to restrictions due to the particular nature of Java applets.
    In addition it keeps the pages it examines in its own cache, so that from one search to the next it doesn't re-read pages that have already been read. (This applies only as long as you do not exit the current search session.)
    The links followed are those found in hyperlinks (e.g. "A HREF" image maps, etc...). HtmlSearch does not follow links to or generated by CGI scripts, Java, or JavaScript calls.
    If a link points to a directory or to an unreachable URL, HtmlSearch will try to access the files 'index.html' and 'index.htm' in that directory.
    Searches can be slow if the pages searched are located on slow or busy servers, or if your modem is slow. The speed of the search also depends on your PC, and the settings in the Advanced panel.

    Starting and Stopping the Search: as stated above, HtmlSearch keeps all the visited pages in its cache. This means that if you stop a search then restart it, even from a different URL or searching for a different string, HtmlSearch will first use its cache, before going back to the network to read the page. This minimizes the search time and network load, and allows for very fast searches the "second time", e.g. when searching for different strings on the same set of pages. The cache is window-specific, i.e. there is one such cache per window where HtmlSearch is loaded, and even if there are several HtmlSearch windows opened, they do not share their cache. All the searches do use the browser's caching mechanism which may also reduce access time and network load.
    The caching mechanism may have a negative impact if you do many searches in the same window, since all the pages visited are kept in the cache: if you search through thousands of pages, the memory requirements may exceed the browser's capacity (this is platform, browser and browser settings dependant). It may therefore be wise to every now and then 'kill' the search and restart in a new window.

     

    Requirements

    HtmlSearch is a Java program, and you need a Java enabled browser (e.g. Microsoft Internet Explorer version 3.0 and above, Netscape version 3.0 and above). It is built on the Java 1.0 JDK and is therefore compatible with browsers using JDK 1.0 and 1.1.

    Depending on your system configuration, where you got HtmlSearch from, and your browser security settings, you may be restricted as to which sites on the WWW you can search with HtmlSearch.

     

    Registration

    A shareware is a program like any other program you can buy in the store, except it is distributed using the honor system, i.e. you can try it, and if you like it, you buy it.
    Besides allowing you to try things out before signing the big check, it also reduces your cost thanks to very low distribution overhead.
    Even if you are using the Lite version, it is a good idea to register so we can inform you of bug fixes.
    To register, please go to the MandoSoft site.

     

    Other Search Methods

    HtmlSearch needs to read each page to search it, which is very inefficient compared to CGI-based searches, not to mention the associated network load. In other words, if the site you are looking at provides a search function, you may be better off using it than HtmlSearch, especially since HtmlSearch may not be able to access all the pages of the site. On the other hand, HtmlSearch provides indexing while the site's CGI-based search may not; the site's provided search may also restrict the searches in ways that may not fit you (e.g. only some of the pages are covered by the search), or not give you the flexibility that HtmlSearch offers.
    For local searches (hard disk, CD-ROM), HtmlSearch is probably slower than operating systems based tools (grep on Unix, Tools->Search on Windows), but these tools do not follow links: they operate on files only.
    There are also some other Java or JavaScript based search engines available from other sources, but obviously HtmlSearch is better :-).


    Top