Let’s whet your appetite for our book SEO for WordPress in advance of the release date. This is a truly awesome excerpt because it talks about robots. The robots.txt file that is. You can buy the book at Amazon.

The Ultimate WordPress Robots.txt File

We learned in Chapter 2 that WordPress generates archive, tag, comment, and category pages that raise duplicate content issues. We can signal to search engines to ignore these duplicate content pages with a robots.txt file. In this section, we’ll kill a few birds with one ultimate robots.txt file. We’ll tell search engines to ignore our duplicated pages. We’ll go further: we’ll instruct search engines not to index our admin area and not to index non-essential folders on our server. As an option, we can also ask bad bots not to index any pages on our site, although bad bots tend to usually do as they wish.

You can create a robots.txt file in any text editor. Place the file in the root directory/folder of your website (not the WordPress template folder) and the search engines will find it automatically.

The following robots.txt is quite simple, but can accomplish much in a few lines:

User-agent: *
Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content/plugins
Disallow: /wp-content/cache
Disallow: /wp-content/themes
Disallow: /trackback
Disallow: /comments
Disallow: /category/*/*
Disallow: /tag/
Disallow: */trackback
Disallow: */comments

Line one, “User-agent: *,” means that that this robots.txt file is to apply to any and all spiders and bots. The next twelve lines all begin with “Disallow.” The Disallow directive simply means “don’t index this location.” The first Disallow directive tells spiders not to index our /cgi-bin folder or its contents. The next five Disallow directives tell spiders to stay out of our WordPress admin area. The last six Disallow directives cure the duplicate content generated through trackbacks and comments, and category pages.

We can also disable indexing of historical archive pages by adding a few more lines, one for each year of archives.

Disallow: /2006/
Disallow: /2007/
Disallow: /2008/
Disallow: /2009/
Disallow: /2010/
Disallow: /2011/

We can also direct email harvesting programs, link exchanges schemes, worthless search engines and other undesirable website visitors not to index our site:

User-agent: SiteSnagger
Disallow: /
User-agent: WebStripper
Disallow: /

The lines instruct the named bots not to index any pages our your site. You can create new entries if you know the name of the user agent that you wish to disallow. SiteSnagger and WebStripper are both services that crawl and copy entire websites so that their users can view them offline. These bots are very unpopular with webmasters because they crawl thoroughly, aggressively, and without pausing, increasing the burden on web servers and diminishing performance for legitimate users.

Tip:

Check out Wikipedia’s robots.txt file for an example of a complex, educational, and entertaining use of the tool. Dozens of bad bots are restricted by the file, with some illustrative commentary.

Buy the Book Today at Amazon

Alesia Matson says:

March 28, 2011 at 12:38 pm

Thank you for this. I needed to exclude over 600 posts by year date and you just helped me do that!

Garrett Vogenbeck says:

June 29, 2011 at 6:47 am

I’ve made plenty of websites with WordPress, and have read a bit about WordPress SEO. I’ve tinkered with the robots.txt file, of course, when I’ve needed to. However, I’ve yet to create a robots.txt file as complex as this. I can’t believe I haven’t been doing this, and that I haven’t, until now, looked at Wikipedia’s robot.txt. That file, with the comments in it, is truly a work of art! Why is a more complex robots.txt not more mainstream in the WordPress industry? I don’t believe that the main SEO plugins are in-depth enough to quickly set up a robots.txt like this.

Thanks for the post, I’m sure I’ll integrate a more in-depth robots.txt file in my WordPress SEO strategy from now on.

Houses For Rent says:

August 25, 2011 at 1:02 pm

Thanks for the helpful info. Have just made up a robots.txt file from your suggestions and also have to say thanks to Garrett too, because the Wikipedia has also been a great help, particularly in blocking a number of bad bots.

Abhishek says:

September 14, 2011 at 1:56 pm

A simple way could be

Disallow: /wp-*

Michael David says:
April 30, 2012 at 9:55 pm

you sir, are a genius…very cool tip…

Ted Goeltz says:

August 13, 2012 at 7:14 am

Since WordPress sitemap plug-ins frequently do not name their sitemap as “sitemap.xml”, its also important to place a sitemap directive in your robots.txt. Example:

Sitemap: http://www.MySite.com/MySiteMap.xml

Sunny Tewathia says:

November 19, 2012 at 4:32 am

great information thanks, i also wanted to edit my robots.txt file.

November 23, 2012 at 3:58 am

Hey, Your way of Blocking archive pages is totally wrong by writing ( Disallow: /2010/ ) this way you will de-index all your 2010 posts from Google, I tried it and it did same to me.
Please answer.

Michael David says:
November 27, 2012 at 7:11 pm

Oh, this is important. If you have your URLs set to http://www.yoursite.com/2010/this_is_a_blog_post, then it is true that Disallow: /2010/ will remove these posts from the index. But why put your posts under such a year-based url scheme? We recommend http://www.yoursite.com/this_is_a_blog_post

Karen Bergeron says:

January 7, 2013 at 4:08 pm

Thank you so much! I immediately ordered your Kindle book and am reading it now, after finding this page. Sure wish I’d thought of this a decade ago. I am trying to redesign my site in wordpress, using similar but neater format than it has now. Ten minutes into the book and learning/being reminded of things I should have done in the last fifteen years. It is a must have for any web designer. Thank you so much for your wonderful advice :-)

Book Excerpt: The Ultimate WordPress Robots.txt File

The Ultimate WordPress Robots.txt File

Leave a Reply

Leave a Reply Cancel reply

Our Most Popular Services

Let’s Talk: How to Get in Touch With Us