Book Excerpt: The Ultimate WordPress Robots.txt File
Let’s whet your appetite for our book SEO for WordPress in advance of the release date. This is a truly awesome excerpt because it talks about robots. The robots.txt file that is. You can buy the book at Amazon.
The Ultimate WordPress Robots.txt File
We learned in Chapter 2 that WordPress generates archive, tag, comment, and category pages that raise duplicate content issues. We can signal to search engines to ignore these duplicate content pages with a robots.txt file. In this section, we’ll kill a few birds with one ultimate robots.txt file. We’ll tell search engines to ignore our duplicated pages. We’ll go further: we’ll instruct search engines not to index our admin area and not to index non-essential folders on our server. As an option, we can also ask bad bots not to index any pages on our site, although bad bots tend to usually do as they wish.
You can create a robots.txt file in any text editor. Place the file in the root directory/folder of your website (not the WordPress template folder) and the search engines will find it automatically.
The following robots.txt is quite simple, but can accomplish much in a few lines:
User-agent: * Disallow: /cgi-bin Disallow: /wp-admin Disallow: /wp-includes Disallow: /wp-content/plugins Disallow: /wp-content/cache Disallow: /wp-content/themes Disallow: /trackback Disallow: /comments Disallow: /category/*/* Disallow: /tag/ Disallow: */trackback Disallow: */comments
Line one, “User-agent: *,” means that that this robots.txt file is to apply to any and all spiders and bots. The next twelve lines all begin with “Disallow.” The Disallow directive simply means “don’t index this location.” The first Disallow directive tells spiders not to index our /cgi-bin folder or its contents. The next five Disallow directives tell spiders to stay out of our WordPress admin area. The last six Disallow directives cure the duplicate content generated through trackbacks and comments, and category pages.
We can also disable indexing of historical archive pages by adding a few more lines, one for each year of archives.
Disallow: /2006/ Disallow: /2007/ Disallow: /2008/ Disallow: /2009/ Disallow: /2010/ Disallow: /2011/
We can also direct email harvesting programs, link exchanges schemes, worthless search engines and other undesirable website visitors not to index our site:
User-agent: SiteSnagger Disallow: / User-agent: WebStripper Disallow: /
The lines instruct the named bots not to index any pages our your site. You can create new entries if you know the name of the user agent that you wish to disallow. SiteSnagger and WebStripper are both services that crawl and copy entire websites so that their users can view them offline. These bots are very unpopular with webmasters because they crawl thoroughly, aggressively, and without pausing, increasing the burden on web servers and diminishing performance for legitimate users.
Tip:
Check out Wikipedia’s robots.txt file for an example of a complex, educational, and entertaining use of the tool. Dozens of bad bots are restricted by the file, with some illustrative commentary.
Thank you for this. I needed to exclude over 600 posts by year date and you just helped me do that!
I’ve made plenty of websites with WordPress, and have read a bit about WordPress SEO. I’ve tinkered with the robots.txt file, of course, when I’ve needed to. However, I’ve yet to create a robots.txt file as complex as this. I can’t believe I haven’t been doing this, and that I haven’t, until now, looked at Wikipedia’s robot.txt. That file, with the comments in it, is truly a work of art! Why is a more complex robots.txt not more mainstream in the WordPress industry? I don’t believe that the main SEO plugins are in-depth enough to quickly set up a robots.txt like this.
Thanks for the post, I’m sure I’ll integrate a more in-depth robots.txt file in my WordPress SEO strategy from now on.
Thanks for the helpful info. Have just made up a robots.txt file from your suggestions and also have to say thanks to Garrett too, because the Wikipedia has also been a great help, particularly in blocking a number of bad bots.
A simple way could be
Disallow: /wp-*
you sir, are a genius…very cool tip…
Since WordPress sitemap plug-ins frequently do not name their sitemap as “sitemap.xml”, its also important to place a sitemap directive in your robots.txt. Example:
Sitemap: http://www.MySite.com/MySiteMap.xml
great information thanks, i also wanted to edit my robots.txt file.
Hey, Your way of Blocking archive pages is totally wrong by writing ( Disallow: /2010/ ) this way you will de-index all your 2010 posts from Google, I tried it and it did same to me.
Please answer.
Oh, this is important. If you have your URLs set to http://www.yoursite.com/2010/this_is_a_blog_post, then it is true that Disallow: /2010/ will remove these posts from the index. But why put your posts under such a year-based url scheme? We recommend http://www.yoursite.com/this_is_a_blog_post
Thank you so much! I immediately ordered your Kindle book and am reading it now, after finding this page. Sure wish I’d thought of this a decade ago. I am trying to redesign my site in wordpress, using similar but neater format than it has now. Ten minutes into the book and learning/being reminded of things I should have done in the last fifteen years. It is a must have for any web designer. Thank you so much for your wonderful advice :-)