How to Fix WordPress Robots.txt after Googlebot Cannot Access JS and CSS Warning

We’ve been doing SEO for WordPress for a long time. A big part of that has always been controlling the amount, and quality, of indexed pages, since WordPress creates so many different flavors of content automatically. If you’ve read Michael David’s book on WordPress SEO, you’ve seen his ultimate robots.txt file
https://tastyplacement.com/book-excerpt-the-ultimate-wordpress-robots-txt-file which goes something like this:
User-agent: *
Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content/plugins
Disallow: /wp-content/cache
Disallow: /wp-content/themes
Disallow: /trackback
Disallow: /comments
Disallow: /category/*/*
Disallow: /tag/
Disallow: */trackback
Disallow: */comments

Unfortunately, we’re in a post-Mobilegeddon world. Google is expecting free access to render every page in its entirety so it can infer the sort of experience a user would have on various mobile devices. A few weeks ago, a significant portion of the WordPress installations in the world received the Google Search Console warning:

Googlebot cannot access CSS and JS files

Some of you may be wondering why we can’t just remove all Robots.txt disallow rules and let Googlebot decide what it thinks is important, and stop being fussy about what’s allowed and disallowed. For security reasons, you don’t want to have deep indexing of your site publicly searchable. For instance, the following search term gives you a list of thousands of WordPress installations which have the highly hackable timthumb.php:

intitle:index timthumb.php

Just something to think about when you assume that Google has your site’s best interests at heart.

It’s possible that you can go through each resource, and allow the precise file paths line by line. But that’s going to be very time consuming.

The solution which has been going around (advocated by the likes of SEOroundtable and Peter Mahoney is to add an additional few lines which explicitely allow Google’s spiders access to the resources in question:

user-agent: googlebot
Allow: .js
Allow: .css
#THE ABOVE CODE IS WRONG!

Yes, this unblocks the javascript and CSS resources, you can see it working in the Search Console fetch and render tool. Unfortunately, this also allows Googlebot access to the entire site.

If you haven’t read the Google developers page on Robots.txt, I highly recommend doing so. It’s like 50 Shades of Grey for nerds. The section under “Order of Precedence for User-Agents” states “Only one group of group-member records is valid for a particular crawler . . . the most specific user-agent that still matches. All other groups of records are ignored by the crawler.” By creating a new group for Googlebot, you are effectively erasing all prior disallow commands.
Google search console allowed robots tester
You can try putting the allow directives within the main group-member, but that won’t work either, because of the order of precedence of group-member records. The longest (most specific) rule is going to win, so the following rules would leave the javascript resources blocked:
user-agent: googlebot
disallow: /wp-content/
allow: .js

google search console robots tester blocked

And wildcard conflicts are undefined. So it’s a tossup result for:
user-agent: googlebot
disallow: /wp-content/themes/
allow: /wp-content/themes/*.js

The long and the short of it is there is no simple cut-and-paste solution to this issue. We’re approaching it on a case by case basis, doing what’s necessary for each WordPress installation.

As far as keeping the indexes clean, we’re going to lean heavily on the robots metatags, as managed by our (still) favorite SEO plugin. Expect the role of robots.txt to be greatly reduced going forward.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *