Let’s whet your appetite for our book SEO for WordPress in advance of the release date. This is a truly awesome excerpt because it talks about robots. The robots.txt file that is. You can buy the book at Amazon.
The Ultimate WordPress Robots.txt File
We learned in Chapter 2 that WordPress generates archive, tag, comment, and category pages that raise duplicate content issues. We can signal to search engines to ignore these duplicate content pages with a robots.txt file. In this section, we’ll kill a few birds with one ultimate robots.txt file. We’ll tell search engines to ignore our duplicated pages. We’ll go further: we’ll instruct search engines not to index our admin area and not to index non-essential folders on our server. As an option, we can also ask bad bots not to index any pages on our site, although bad bots tend to usually do as they wish.
You can create a robots.txt file in any text editor. Place the file in the root directory/folder of your website (not the WordPress template folder) and the search engines will find it automatically.
The following robots.txt is quite simple, but can accomplish much in a few lines:
User-agent: * Disallow: /cgi-bin Disallow: /wp-admin Disallow: /wp-includes Disallow: /wp-content/plugins Disallow: /wp-content/cache Disallow: /wp-content/themes Disallow: /trackback Disallow: /comments Disallow: /category/*/* Disallow: /tag/ Disallow: */trackback Disallow: */comments
Line one, “User-agent: *,” means that that this robots.txt file is to apply to any and all spiders and bots. The next twelve lines all begin with “Disallow.” The Disallow directive simply means “don’t index this location.” The first Disallow directive tells spiders not to index our /cgi-bin folder or its contents. The next five Disallow directives tell spiders to stay out of our WordPress admin area. The last six Disallow directives cure the duplicate content generated through trackbacks and comments, and category pages.
We can also disable indexing of historical archive pages by adding a few more lines, one for each year of archives.
Disallow: /2006/ Disallow: /2007/ Disallow: /2008/ Disallow: /2009/ Disallow: /2010/ Disallow: /2011/
We can also direct email harvesting programs, link exchanges schemes, worthless search engines and other undesirable website visitors not to index our site:
User-agent: SiteSnagger Disallow: / User-agent: WebStripper Disallow: /
The lines instruct the named bots not to index any pages our your site. You can create new entries if you know the name of the user agent that you wish to disallow. SiteSnagger and WebStripper are both services that crawl and copy entire websites so that their users can view them offline. These bots are very unpopular with webmasters because they crawl thoroughly, aggressively, and without pausing, increasing the burden on web servers and diminishing performance for legitimate users.
Check out Wikipedia’s robots.txt file for an example of a complex, educational, and entertaining use of the tool. Dozens of bad bots are restricted by the file, with some illustrative commentary.