AdSense hates documentation, or have at least chosen to not document how you can exclude articles from appearing in the AdSense Matched content sections. Luckily, thereâs in fact a system for it and Iâll even tell you how to use it.
Matched content is a article recommendation system from AdSense that tries to suggest relevant and interesting articles to your websiteâs visitors. Itâs currently only available to publishers with many pages and high levels of traffic. Youâve probably seen many of these content recommendation systems all around the web. The one thing they all have in common is that they most often make bad recommendations.
Not all AdSense publishers are eligible to use Matched content. Many publishers are interested in the system, however, as it has all of Googleâs knowledge and personal personalization experience behind their recommendations.
Sometimes, however, it chooses to feature something youâd rather not have put in front of too many people. These are articles that may be arenât all that flattering to your business, they may be outdated or otherwise no longer relevant, or may be theyâre just boring legal pages. Whatever the reason, itâs sometimes desirable to exclude certain pages from the becoming recommended reading for your visitors.
You can exclude a page using the standard /robots.txt
file, but this will also exclude your page from search engines which maybe isnât what you want.
If youâre running Matched content units from AdSense, a.k.a. âRecommended by Googleâ, you may have noticed some new bot activity from Google. Roughly once per day, a robot identifying itself as âMediapartners-Google
â originating from an IP address belonging to Google requests the following three files:
In standing with the finest AdSense traditions, the formats of these files arenât documented anywhere.
These files are placed at the top level/root of your web server. Their purpose is made clear from their names, however, and their format can be determined with some trial and error. But please do keep in mind that as the following information isnât documented by AdSense, it could mean that AdSense may change it up or stop supporting it at any time.
The two text files was are enough to work out: List one URL per line that you wish to exclude in the blocklist, and list exceptions to broader exclusion rules in the allow-list. You can make a broad matching rule using wildcard expressions wit the *
character. For example, https://www.example.com/documents/
doesnât exclude other pages underneath /documents/ unless you add a wildcard at the end like this: https://www.example.com/documents/*
.
If youâve excluded a directory using wildcards, you can still allow certain pages to appear in content recommendations by including i in the allow-list file. The longest URL (in number of characters) in either your blocklist or allow-list seem to take precedence. Thereâs no point in including rules for allowed URL in the allow-list, as this is the default policy anyway.
The /google_matched_content_rules.xml
file was a bit trickier to work out. In the end I set up a crawler to retrieve this file from the Alexa Top 1 Million websites, hoping that some publishers would have this file available. Six out of a million websites had this file, but that was enough for me to work out the format and options.
Here is a simple sample rules file:
So far, Iâve only shown the same capabilities in the rules file as youâve got with the plain text block and allow list files. However, Iâve also identified some more powerful features, though Iâm not entirely sure on how theyâre supposed to work. The sample size was rather small and Iâve had more trouble verifying these, although I do my guesses as to their application are pretty spot on. Some documentation from AdSense would be of great help here! Iâll provide a more complex example, and then move on to speculate about how it all hangs together afterward.
As Iâve already mentioned, I can only speculate on the exact purposes of the section
, source
, target
, and rss
elements. However, my educated guesses as to their purposes are as follows:
A sectionâs source identifies a partial URL that your readers visit on your website. While on this section, as identified by their source, matched content should preferably recommend content that match the URLs of the identified target addresses. This seems correct given the format, but Iâve been unable to confirm that this is how it works. Recommendations seem to still be at AdSenseâs discretion.
This can be a powerful feature to many publishers who publish frequently and on diverse topics.
The last element, rss
, seem to specify the URL of a syndication feed. I presume this feed is fetched more often than once per day like the main rules files, and I further guess that links in the feed may be prioritized above other links when theyâd provide relevant recommendations. I havenât been able to confirm this either, but I canât see any other use this tag would have.
Google AdSense doesnât have any contact options, so I was unable to inquire for any input for this article.
Iâll update this post later as more information from AdSense becomes available.