[an error occurred while processing this directive]

chami.com/tips/
Last  Home  Next
 Internet
 Programming
 Windows



Keywords
HTML
Internet
Tutorial
Web

Downloads
meta1.htm
meta2.htm
meta3.htm
meta4.htm
metasamp.htm
robots1.txt
robots2.txt
robots3.txt
robots4.txt

Keeping robots, spiders and wanderers away from your site using robots.txt, meta tags and other methods

   See Also
 Add a search engine to your site
 Find out who's got a link to your home page
 How to search for a newsgroup
 Let them search engine robots know what your page is really about using DESCRIPTION and KEYWORDS meta tags
 robots.txt file related security issues

Trying to keep those search engine spiders, wanderers and other cataloging robots away from your top secret web pages? Here are some measures you can take:

robots.txt

The "Robots Exclusion Protocol", the protocol designed to help web administrators and authors of web spiders agree on a way to navigate and catalog sites, require that you place a plain text file named "robots.txt" containing spidering rules, in the root directory of a site. It is important to note that this file must reside in the root directory of the main site, not in any other directory. For example, if your site is www.chami.com, the file must be accessible from http://www.chami.com/robots.txt

The content of the robots.txt file mostly consist of two main commands: "User-agent" and "Disallow".

The "User-agent:" command should specify the name or the signature of the robot which the spidering commands following it should be applied to. You can set this to * to instruct that the spidering commands should be applied to any robot that has not been identified in any other place inside the robots.txt file.

The other command, "Disallow:" specifies a partial URL that should be ignored (not index) by the previously identified web robot. If you leave this field empty, this will be interpreted as a license to navigate any and all pages in your site, by the specified web robot.

Let's take a look at some example robots.txt files:

• Tell all robots to go away (do not index any page in this site):
 
User-agent: *
Disallow: /
Listing #1 : Text code. Right click robots1.txt to download.

 
• Tell "WebCrawler" robot, for example, to leave this site alone. All other robots are welcome:
 
User-agent: WebCrawler
Disallow: /
Listing #2 : Text code. Right click robots2.txt to download.

 
• All robots should stay away from /~mydir/ Other directories are not restricted:
 
User-agent: *
Disallow: /~mydir/
Listing #3 : Text code. Right click robots3.txt to download.

 
• WebCrawler can access all directories except /~mydir/ All other robots may access all directories except /docs/, /private/ and /cgi-bin/:
 
User-agent: *
Disallow: /docs/
Disallow: /private/
Disallow: /cgi-bin/
User-agent: WebCrawler
Disallow: /~mydir/
Listing #4 : Text code. Right click robots4.txt to download.

    NOTE:Since the Robots Exclusion Protocol is not acknowledged by all web robot authors, it is not possible to stop all robots from wandering your site. However, the good news is that majority of the well known search engines and tools support this protocol. Refer to their documentation to verify this.


ROBOTS : NOINDEX, NOFOLLOW

One of the major disadvantages of using the robots.txt file is that you must be able to place this file in the root web directory. If you don't have such access or if your web space provider is unable to give you a hand with this, you'll need a different way to stop robots. This is mainly why the "ROBOTS" META tag was created. Unfortunately even less number of robots look for this META tag at the moment.

ROBOTS META tag in action:

• Tell all robots to go away and not to index any pages in this site):
 
<META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">
Listing #5 : HTML code. Right click meta1.htm to download.

 
• Allow indexing of the current page, however ask not to follow links inside the page for further cataloging:
 
<META NAME="ROBOTS" CONTENT="INDEX,NOFOLLOW">
Listing #6 : HTML code. Right click meta2.htm to download.

 
• Disallow indexing of the current page, yet allow following links inside the page:
 
<META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW">
Listing #7 : HTML code. Right click meta3.htm to download.

 
• Allow indexing and following links inside the page. This is the default action for robots, so it's not necessary to use this tag:
 
<META NAME="ROBOTS" CONTENT="INDEX,FOLLOW">
Listing #8 : HTML code. Right click meta4.htm to download.

 

Note that all META tags should be placed inside the HEAD section of your HTML document. For example:

<HTML>

<HEAD>

  <META NAME="ROBOTS" CONTENT="INDEX,FOLLOW">

  <!-- more meta and other tags -->

</HEAD>

<BODY>

  <!-- document body -->

</BODY>
</HTML>
Listing #9 : HTML code. Right click metasamp.htm to download.


Conclusion

Still unsure on how to proceed?

    TIP:
  1. Use the robots.txt file if possible
     
  2. ROBOTS META tag should be used if you can't create the above file. It's okay to use both methods if possible.
     
  3. If you know which robots you're trying to prevent from indexing your pages, a particular search engine for example, go to the source of the robot and remove your page if possible. In other words, many search engines are providing ways for you to remove your URLs from their indexes without having to use any of the above methods.
     
  4. Make your page stand-alone if possible. Meaning, remove links to the page that you're trying to keep away from robots. More links there are to your page the easier it is for a search engine robot to find your page. If your page is already in search engine indexes, it's too late to take this preventative step.
     
  5. If you must have absolute protection from robots, password protect those pages in question. Since all other methods are "agreements" that both parties must acknowledge in order for them to work in full, preventing the page from being served is the only way to guarantee that robots will not be able to touch your pages.

Agents Abroad
"A list of bots; robots, spiders, agents, crawlers, and other automated intelligent agents."
 
BotSpot
A resource for all things Intelligent Agent and bot related, including Bot of the Week.
 
Robot Exclusion Standard Revisited
"A document intended to highlight some issues involving the current standard for robot exclusion, as well as to propose some suggestions for future expansion of the standard." -- June 2, 1996
 
The Web Robots Pages
The standard, FAQ, list of active robots, mailing list and other related sites.
Links Listing #1 : Web robots related resources

 
Related LinksEmailPrint 
Created on 1-Jan-1998. Updated on 26-Jul-1998.
Copyright (C) 1996-99 Chami.com All Rights Reserved. Reproduction
in whole or in part or in any form or medium without express written
permission of Chami.com is prohibited. Information on this page is
provided as-is without warranty of any kind. Use at your own risk.
Source code colorized using CodeColorizer.