How to Run Screaming…
…Frog in Google Compute Engine
There have been a few blog posts about using Screaming Frog to crawl extremely large sites. By controlling the depth of the crawl and your hardware resources, you can do quite a lot. For a year we crawled extremely large sites on a dedicated laptop with its RAM slots expanded to the limit, and we successfully did millions of pages this way. The most recent release of Screaming Frog has a database option which saves its crawl data on your local hard drive, effectively removing size limitations.
Except there are a lot of limitations. We could never get past two million pages crawled, due to a number of considerations, particularly time constraints. A million pages took about a week to crawl using the database storage method. Saving a crawl project took all day. Producing a 404 inlinks report would take hours. And after a lightning storm crashed the computer and destroyed a week of data, we figured there had to be another way.
Clearly, our biggest limitation was the hardware. As an advertising agency, we don’t have a lot of computational resources onsite. Website hosting doesn’t happen in the office, and most of what we do with our computers is link up with much better computers elsewhere.
If you think about it, there’s one company we know for sure has the hardware capacity to crawl scads and scads of pages. And as luck would have it, they also rent that hardware out by the minute.
Setting Up Google Compute Engine
There’s a lot of fun stuff in the Google Cloud Console, like Bigquery and APIs for AI services, but what we want is about halfway down the list, Compute Engine. I’m going to assume that you’ve already setup billing and IAM permissions in a way that makes sense for your organization. It may take you a little while to figure out the byzantine levels and classifications of permissions.
WARNING: MAKE SURE YOU HAVE BUDGET LIMITS AND WARNINGS IN PLACE. Because if you’re not careful, you can rack up thousands of dollars of expenses in here, but if you are careful, it’ll only be pocket change.
Compute Engine lets you create a virtual machine of any of the popular flavors of Linux (or a virtual Windows Server, if you’re depraved enough to use Windows), and you can scale the power to your needs and budget. The default VM instance has about the same power as a cheap laptop, and on the upper end they’re virtually supercomputers. Consequently the cost will be anywhere from $24 a month to $1649 a month.
So after you click on Compute Engine, click on create instance.
Give the instance a funky name like “where-the-frog-lives” or “scream-town.” If this is the only VM you put together the naming scheme won’t matter, and if it’s the start of a profitable relationship with Google Cloud, it will ensure that future IT managers will hate your guts.
The “machine type” we’ve found that has the best balance between power and cost is the n1-highmem-8. It has enough memory to crawl essentially any site you can think of (52GB), and it’ll only cost about eleven bucks a day (if you remember to turn it off when you’re done with it). Don’t worry, you can scale up or down as necessary pretty much on the fly after you know what your needs are. Again, make sure you set budget limits and alerts in your account. In general, you can avoid most charges when you turn off your VM instance. Goodness help you if you forget about it and leave it running.
For “Boot Disk” change it to Ubuntu 14.04 LTS. This will give you an operating system much like the one you installed on that five-year-old laptop to squeeze a little more life out of it, except in this case the laptop will have eight cores and ten times the RAM.
You will probably want to set a static IP at this point, since you will need to have your IP whitelisted by the targeted server or your super-fast crawl will get you blocked pretty quick. Otherwise, every time you stop and turn on your VM, you will get a new external IP (which can also be useful if we’re honest).
Since you will want to stop this instance and come back to it from time to time, you will need a persistent disk. Start with 100GB, but you may need to upgrade that as necessary. A little farther down we’ll talk about pushing large datafiles to Google Storage and from there to Bigquery. This will be the only way to deal with giant Screaming Frog reports.
A less polite solution to the whitelist problem is to have Screaming Frog use a proxy which you can set at Configuration » » System » » Proxy. Alternately you can stop your VM instance every time you get blacklisted, and once you turn it back on you’ll have a new IP, provided you didn’t set a static one earlier.
Installing Screaming Frog in Compute Engine
It’s possible to run Screaming Frog from the command line, but it’s hard to find documentation on that. A much easier and robust solution: Give your Compute Engine instance a virtual desktop and use Screaming Frog like you would normally.
This article by Aditya Choudhary covers the step by step process for installing a GUI interface for your Compute Engine instance.
I won’t repeat these instructions (except for the firewall settings which are out of date), but the general process is to install Gnome components and the VNC virtual desktop server. Then you need to open the ports which allow your VNC client to connect. I’ve found that TightVNC works pretty good as a VNC client.
The firewall for Compute Engine is a little tricky. By default everything is blocked. To get to the firewall settings, start from the cloud console list of virtual machine instances, then click on name of the VM you’re using, which takes you to the “details” page. Then click on the “default” link from the Network Interfaces section.
Then click on “firewall rules” on the left hand navigation. Then click “create firewall rule” and make the rule which will open up the VNC port for your personal IP. If they tried, they might be able to make that a little harder to find.
At this point you should have a TightVNC instance open, looking at a blank Ubuntu desktop. You should also have an SSH connection open, or if you’re a wiseguy, you can open a terminal window through VNC.
Compute Engine instances don’t have root passwords by default, so you need to create one with the command:
In your fancy GUI interface, open up Firefox (pre-loaded probably) and go to the Screaming Frog website. Download the Ubuntu installation file which should save as a *.deb. It’s easiest to install this from the command line with:
sudo dpkg -i *.deb
sudo apt-get install -f
After that installs, Screaming Frog will appear under the “internet” tab of the start menu. Presumably you already have a pro license key to enter into Screaming Frog, otherwise you will hit your crawl limit pretty quick.
Screaming Frog Deep Crawl Settings
Most of the configuration settings for Screaming Frog are features which limit its power and capability. But limits are for plebes. You now have world-class server power at your fingertips. Your settings will unleash the full power of your spider. Crawl beyond limits my pretties!
Configuration » » Spider » » Basic: go ahead and check everything. Let’s be ambitious!
» » Limits: uncheck everything
Configuration » » Speed » » Max Threads: I’ve done 40 at once, which translates to around 3000 pages a minute. This will get you blocked by nearly everyone. Start with 10 at once and see how far that gets you.
Configuration » » System » » Memory: Subtract 2GB from what’s available on your machine type, in this case enter 50GB.
Configuration » » System » » Storage: Mode = Memory Storage, unless you don’t have much RAM, then go with database mode, which will run much slower, but if you got the machine type I recommended, you should have more than enough memory space.
Configuration » » System » » Proxy: use a disposable proxy if you think it’s likely you’ll get blocked. Getting blocked throws the whole crawl for a loop, all the URLs crash and you don’t get the same depth, so it’s better to avoid it by getting the server admins to whitelist you ahead of time.
Starting and Stopping the Instance
I’ve mentioned it a few times already, but Google is charging you by the minute, so you need to shut down the instance when you’re not actually crawling. The persistent disk and the static IP will also cost you money, but much less. If you click the stop button on the Cloud Dashboard, the VM will shut down just as if you gave the shutdown command from terminal. The persistent disk will keep everything as you left it until you turn it back on. Just like a normal Ubuntu machine though, any apps you had open will need to be restarted. So be sure to save your Screaming Frog crawls before shutting down (that’s why we have the large persistent disk).
When you restart the VM, you will need to restart the virtual desktop with the SSH command:
When you restart Screaming Frog, you can also reload any crawls you haven’t finished. Loading a giant crawl takes forever on a normal computer, but it will only be a few minutes at most on Compute Engine.
Integrating with Cloud Storage and Bigquery
So you’ve crawled a mammoth enterprise-sized site, and now what are you going to do with all those reports? That’s where the other amazing Google Cloud products come in. You’re going to export the data in CSV format to Google Cloud Storage. From there you’re going to import into Bigquery, because there’s not a spreadsheet in the world that will open that much data.
Because Google has already thought of this, there’s an SDK that integrates neatly with other Cloud products. You initialize with the SSH command:
Follow the prompts to authorize your Compute Engine using the web browser which you’re already logged in through. Now your storage buckets are only a few clicks away. Export your hard-won spider file to the bucket with a command like:
Gsutil cp /home/*.spider gs://screaming-frog-bucket
It’s pretty much like having an attached drive to your virtual machine which is arbitrarily large and ridiculously cheap.
You will also want to export your various reports, like your 404 inlinks report or the insecure content report to the bucket. Sure, you can work with that data through Screaming Frog in your little graphic interface window, but if you have any SQL experience, Bigquery will be much easier. Once you copy over your CSV reports to your storage bucket, it’s just a matter of importing into BigQuery
If you’re going to leave the Compute Instance idle for a few weeks, you will want to step down the persistent disc size to something like 10GB, to cut back on costs. Simply transfer any large crawl files to Cloud Storage and then hit the stop button. You’ll want to check in on billing from time to time to make sure you’re not getting charged for something you don’t need. In the meantime, your absurdly powerful site crawl platform is there ready and waiting.
Be careful with this tool, don’t overload and break anyone’s site, and let us know how it works out in the comments!