Explore some of the lesser-known robots.txt search directives

This week, I grabbed some 667 964 /robots.txt robots exclusion files from the Alexa Top 1 Million domains. Here is what I learned about some of the lesser-known robots directives they contained.

Some quick and dirty data: 66,79 % of the domains in the top-one-million list have a robots file. 5,29 % of the domains returned HTML files (mostly HTTP 404 Not Found error pages), 1,55 % returned empty files, and 0,15 % of the domains returned other types of files.

The most common reason for faulty parsing was the inclusion of HTML comments with debug information. Most out of these comments were unique, except for the WP Super Cache plugin for WordPress that alone was responsible for putting comments in 0,08 % of all the robots files (fix provided to project).

0,008 % of domains politely greets robots with some variation of the comment “# hi robot!”. A few even greet them in their own language, “# bleep bloop bleep”.

I’ll assume that anyone reading this is already familiar with the User-agent, Disallow, Allow, and Sitemap directives. However, some of the top misspellings of the “Disallow” directive include 5001 domains spelling it “Disalow”, 765 “dissalow”, 381 “dissallow”, […], and 31 just went with “sallow”.

If you’re already familiar with the basic directives and their spelling, let’s take a closer look at the most popular among the unusual directives that I discovered.

`Crawl-delay`

Sets a delay between each new request to the website. For example, Crawl-delay: 12 tells a crawler to wait 12 seconds between each request; limiting it to no more than five-page requests per minute.

This directive is recognized by Bing, Yandex (45 % market share in Russia, and 20 % in Ukraine), Naver (40 % market share in South Korea), and Mail.Ru (5 % market share in Russia).

Due to the distributed nature of search crawlers, you may see more requests than expected as its unclear whether the limit applies to the entire pool of crawlers or applies to each individual crawler. Bing specifies that the limit is applied to their entire crawler pool, but none of the other search engines provide any documentation on this.

User-Agent: *
Disallow:  # allow all
Crawl-delay: 6

Yandex supports fractional values giving higher precision control than whole seconds. I’d recommend sticking with integers as none of the other crawlers advertise support for anything but whole seconds. Sticking with the lowest common denominator seems like the way to go to avoid problems.

40 % of the scanned dataset of robots files use a 10-second crawl delay, and the average crawler delay is 12,78 seconds. 30 seconds is the maximum delay recognized by most search engines. The average was adjusted to reflect a maximum delay of 30 seconds.

This directive was seen on 78 516 domains. Fractional values were found on 1266 domains. The documentation provided by Yandex, Naver (in Korean), and Mail.Ru (in Russian).

`Host`

Sets the canonical domain name for the current server that served the robots file. The directive is used by Google, Yandex, and Mail.Ru.

User-Agent: *
Disallow:  # allow all
Host: www.example.com

This directive was seen on 42 408 domains. The documentation provided by Yandex and Mail.Ru (in Russian).

2999 domains provided a URI rather than a hostname. Half of these uses the HTTPS protocol. This use is entirely undocumented, but Yandex’s robots.txt tester doesn’t raise any errors over it.

`Clean-param`

Indicates query parameters that don’t affect content (such as tracking) that should be stripped from URLs. For example, “Clean-param: referral /” would transform “/document?referral=advert” into the canonical address “/document”.

This directive is used by Yandex and Mail.Ru. This is essentially a different take on Google’s address canonicalization initiatives using link tags. The main difference to the canonical link tag is that it reduces crawler activity on your server as address canonicalization occurs on the crawler. Duplicate pages shouldn’t even have to be requested by the crawlers.

User-Agent: *
Disallow:  # allow all
Clean-param: utm_campaign /
Clean-param: referral /

This directive was seen on 2651 domains. The documentation provided by Yandex and Mail.Ru (in Russian).

Extended Robots Exclusion “Standard”

The proposed Extended Robots Exclusion “Standard” was developed by Sean Conner in the late 90’s but hasn’t received much attention. Seznam controls some 20 % of the search market in Czechia, and their SeznamBot is the only known implementation in a large search engine.

The standard includes the common User-agent, Disallow, and Allow directives. It extends the standard with these additional directives:

`Request-rate`

A variation of Crawl-delay that sets the request rate rather than a delay between each request. For example, Request-rate: 5/1m isn’t equivalent to Crawl-delay: 12 as all five requests could be filled in the first few seconds within one minute. (Request-rate: 1/12s would be equivalent).

User-Agent: *
Disallow:  # allow all
Request-rate: 10/1m

This directive was seen on 1315 domains. The documentation provided by Seznam.

`Visit-time`

Sets the times of day that crawlers should access the site. The standards proposal doesn’t clarify the intent, but this was probably intended to limit crawlers to only access the site at night.

This doesn’t scale as everyone would want robot activity at night, freeing up resources for humans during the day. The proposal doesn’t touch on time zones, but SeznamBot — as the only known adopter — specifies the time zone as UTC.

User-Agent: *
Disallow:  # allow all
Visit-time: 01:45-08:30

This directive was seen on 864 domains.

Update (2016-05-25): Seznam has removed all mentions of the Visit-time directive from their documentation, meaning there are now no known implementors.

If you’re using a Crawl-delay directive, it wouldn’t hurt to supply the same information through the Request-rate directive as well. However, specifying the Visit-time seems completely pointless.

`Indexpage`

360 Search (formerly HoaSou) receives about 20 % of the daily searches in China. Their 360Spider (sometimes HaoSouSpider) uses this unique robots directive as an indicator for pages that are frequently updated such as sub-forum index pages, newspaper sections, and other types of pages that serve as frequently updated indexes of new pages.

360 Search indicates that they use these as hints to determine how often to fetch pages to discover new content. Like the Sitemap directive, the Indexpage directive must use an absolute URL rather than a relative URL.

User-Agent: *
Disallow:  # allow all
Indexpage: https://www.example.com/articles/archive/
Indexpage: https://www.example.com/forum/?order=newest
Indexpage: https://www.example.com/videos/*-category$

Unlike Baidu (60 % marketshare), 360 Search doesn’t offer machine-translations from English to Chinese so whether it’s worth adopting this directive depends on your desire to reach into the Chinese market and your type of content. Including it at the bottom of your robots file shouldn’t hurt.

However, I’ve run an experiment on this website for over a month and have seen virtually no increase in traffic from 360 Search after adopting this directive. The number of pages indexed by their robot did however grow from ~30 % to 82 % of the pages on this domain in the same time period. This could indicate there’s some level of preference given to sites that include this directive.

This directive was seen on just 12 domains. Documentation is provided by 360 Search (in Chinese).

Now for some directives that don’t have any effect on anything, haven’t ever had any effect on anything, and likely will not ever have any effect on anything. These are just made-up directives with no known implementation that have seen some adoption for whatever reason.

`Noindex`

This supposedly “secret and undocumented” directive was cooked up by “SEO Specialists” needing to fill their blogs with new and interesting content. It likely originated in the misunderstanding of the noindex meta tag.

It’s supposedly more powerful than the “Disallow” directive as crawlers could still request pages and find links to other pages without including the page itself in search results. No known crawlers are using this directive and it’s entirely pointless, but hey – easy SEO points!

This directive was seen on 26 423 domains. I’ll not cite any sources here because I don’t want them to get any more traffic and attention for this doubious “innovation”.

Automated Content Access Protocol

ASCP-crawler, ASCP-disallow-crawl, and ASCP-allow-crawl are exactly equivalent to User-agent, Disallow, and Allow. They serve the same purpose but has a copyright holders’ interest group’s acronym as a prefix and suffixed with “-crawl[er]. Innovative. There are no known implementors; why would there be any?

These directives were seen on 245 domains. Documentation is provided by IPTC.

I believe it would be worthwhile to adopt any of these directives that you find interesting (except for Noindex, of course). There aren’t many crawlers that support these lesser-known directives, but search engines will adopt new directives as they start to see them in active use.

It could be worthwhile to explore the different options and see whether your site can break into new markets by increasing its search presence around the world. It doesn’t hurt to adopt new directives. Lines in robots files that contain directives that aren’t understood by crawlers will simply be ignored. (Assuming the crawler follows the robots.txt specification.)

Placing the lesser-known directives towards the end of your robots file could help assure that any parsing error would happen after the more popular directives already have been accepted by their processors.

You can see an example of many of these directives used in combination by looking at ctrl.blog/robots.txt. They may not be too widely supported, but there’s no real harm in specifying more directives.

Ctrl.blog