Every webmaster wants traffic to their website. The best way to get it is to have a search engine crawl your site, index your pages, and make them available in their search results.
The king of search is, of course, Google. Getting indexed by Google isn’t hard, and once their crawler, Googlebot, starts hitting your pages then you’re on your way. But what happens when Googlebot gets a bit overzealous and starts indexing too much at once?
I’ll tell you what happens. Resources go through the roof, pages slow down, and in some cases, your site visitors will start getting errors messages instead of pages. It’s also possible that cats and dogs may start living together, but that may only happen if Googlebot crawls Bill Murray’s website.
How to find out if Googlebot is indexing your website
There are a couple ways to find out if Googlebot is crawling your website. One is to use a statistics program like AWStats, Webalizer, or Analog Stats.
Statistics applications can give you some insight into what is hitting your website, but they don’t usually give you the kind of detail you need to find out if Googlebot is running you over with crawls. Instead, they tend to focus more on the total hits in a wide timeframe, such as a whole month.
That won’t help if you’re trying to find out if Googlebot is crawling your site too much this week, yesterday, or right this moment. So how can you find out if Googlebot is crawling your website during a particular timeframe?
Check the logs through cPanel
If you use cPanel, you can use the Visitors log to find out who has visited your site recently. It will show the most recent 1000 visitors, which page they visited and when. Best of all, it’s searchable. Just enter “Googlebot” in the search field and you’ll get a list of records where Googlebot hit your website pages within the last 1000 visitors.
To get a more specific look at when Googlebot is crawling your website, you’ll need to look at your domlogs directly. cPanel provides a link to your logs via the Raw Access option. It’s in plain text format but unless you have an application to parse the data into an easier to consume format, this will be mostly unusable.
After all, you may be looking at tens of thousands of records, or more much if your website is popular.
Use SSH to find Googlebot
If you have access to SSH, you can run a grep search of your logs to find Googlebot. If you are using a cPanel server, you can use the following command to get a count of how many times Googlebot has visited your site in a specific timeframe.
sudo grep ‘Googlebot’ /usr/local/apache/domlogs/yourdomain.com | grep ’23/Nov/2017:03:4′ | wc -l
In this example, we’re searching the domlogs of yourdomain.com. We’re checking for any instance where Googlebot shows up on November 23rd, 2017 between 3:40:00 am and 3:49:59 am. The result will be the number of times the string “Googlebot” was found during that timeframe.
Now that you know Googlebot is beating up your site, how can you stop it from happening?
Slowing Googlebot’s crawl rate
You may have heard of a file called robots.txt. If you’re not familiar with it, check out this page to learn about what a robots.txt file is and how it can help protect your website from the fury of web crawlers.
Most legitimate web crawlers obey the robots.txt file. Unfortunately, Googlebot is not one of them.
Actually, that’s not entirely true. Googlebot will ignore directories if you disallow them in your robots.txt file. What it won’t do is obey the crawl rate. Fortunately, they provide their own method of slowing Googlebot.
Google’s Search Console (formerly called Webmaster Tools) is a service from Google that provides webmasters with a set of tools to help optimize and index their websites. It also provides a throttle for Googlebot.
To slow down Googlebot using the Google Search Console, follow these simple steps:
- Sign into Google Search Console.
- If you haven’t added your site yet, add your site. Not sure how? We have a guide for that. If your site has been added, click on your site.
- Click the gear icon and select Site Settings.
- Under Site Settings, set the Crawl Rate option to Limit Google’s maximum crawl rate. You will then get an option to modify the rate at which Googlebot crawls your website.
That’s it! You’ve now successfully modified the Googlebot Crawl Rate.
As webmasters, we welcome search engine spiders. They help drive a lot of traffic to our websites by scanning our content and indexing our pages. Unfortunately, some of them can get a bit overly excited and beat your site through the floor.
For Google, limiting the Googlebot crawl rate is done easily through the Google Search Console. Check it out, it has a lot of great features.